Using Regression Model Analysis for Forecasting the Likelihood of Particular Symptoms of COVID-19

A certainty factor (CF) rule-based technique is frequently used by traditional expert systems (TES) in the medical industry to compute several symptoms and identify the inference solutions. The primary concern for this TES was predicting the likelihood of a particular ailment in the circumstances of new patients. Based on symptoms connected to clinical indicators in patients' diagnosis, CF is estimated. This TES probably won't be able to forecast unknown things, like the possibility of a particular ailment. Therefore, supervised learning techniques like linear regression can address this issue. We attempted to analyze the current COVID-19 TES by modeling the regression equation to forecast the chance of a particular disease that is COVID-like based on the CF value and the confidence level of the symptoms. To examine the most effective regression model to address the issue, we employed multi-linear regression (MLR) and multi-polynomial regression (MPR). The findings demonstrate that the MLR and MPR models are the most accurate regression models for estimating the chance of a disease associated with COVID-like symptoms. Our work built a basis for the creation of expert systems by concentrating more on MLES (machine learning expert systems) analytical techniques than TES.


Introduction
The emergence of the novel coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus has ignited a global health crisis of unprecedented proportions [1]- [4].As this pandemic continues to impact societies worldwide, effective management and mitigation strategies rely heavily on accurate and timely forecasting of disease spread and symptom likelihood [5], [6].One of the crucial aspects in understanding and combating the spread of COVID-19 is the identification of specific symptoms that indicate the presence of the virus in individuals [7], [8].
In recent times, epidemiology and medical research have witnessed an accelerated utilization of data-driven approaches to comprehend and predict disease transmission dynamics [9], [10].Leveraging the power of advanced statistical techniques and machine learning, researchers have made significant strides in modeling the spread of COVID-19 at a population level [11], [12].However, there remains a compelling need to focus on predicting individual-level manifestations of the disease, especially the likelihood of experiencing particular symptoms associated with COVID-19.
Furthermore, the expert system tried to help medical experts for solving the patient's problem [13], [14].Several studies about expert systems adopted in human medicine, including internal medicine [15], [16], psychology [17]- [20], COVID-19 [21], [22], cancer [23]- [26], and traditional medicine [27], [28].The expert system tried to perform human health problems, such as treatment, forecasting, and diagnosis [29].Meanwhile, Al Hakim et al. [21] and Suprayitno et al. [30] developed the COVID-19 expert system based on mobile smartphones to solve the pandemic.Unfortunately, this current professional system has not been evaluated for performance and essential future forecasting purposes (for new patients with new symptoms).As a result, the study attempted to evaluate the performance of expert systems in regression modeling analysis using supervised learning (SL).
This research paper aims to contribute to the ongoing efforts in forecasting COVID-19 by employing regression model analysis to predict the probability of individuals exhibiting specific symptoms.By delving into a comprehensive dataset of confirmed COVID-19 cases and symptom profiles, we intend to construct predictive models that can offer insights into the likelihood of particular symptoms manifesting in infected individuals.The results of this study hold potential implications for public health interventions, clinical resource allocation, and personalized patient care.Therefore, this research is focusing on: 1.What are the results of the statistical analysis of the prediction of this study?2. How to predict the specific likelihood of disease in symptoms of COVID-19 (COVID-like symptoms) using supervised machine learning based on the certainty factor method results? 3.If the prediction model is more effective, should it be applied to support new or related future symptoms of the COVID-19 variant?This paper proposes to analyze a regression model for predicting (forecasting or estimating) COVID-19-related symptoms, i.e., the probability of a specific disease, such as symptoms similar to COVID-19.In a regression model, this study attempts to model the likelihood of particular COVID-19 symptoms in the future based on collected symptoms and improve expert system prediction performance.This paper supports the role of health information in expert system functions and is a new and timely paper, as machine learning (ML) is crucial to predicting future risks of COVID-19-related symptoms.
The document contributes to improving the performance of traditional expert systems using the method of certainty factors and provides an opportunity to use intelligent certainty factor calculation methods to predict the specific likelihood of diseases in cases of COVID-19.Health informatics, biomedical engineers, and computer scientists have a role in improving the performance of traditional systems methods of expertise, including safety factors.
Traditional expert systems (TES), such as the CF method, are used to calculate uncertainty, including knowledge based on human experts (CF rules) and users (CF users).One of the most popular algorithms used to calculate CF values is MYCIN [66].This rule-based algorithm of TES tried calculating final CF values that represent symptoms (symptoms coded) [20], [21], [67], [68], or datasets of specific-related symptoms of some disease-causing [69]- [71].The Equation ( 1) is the Equation of the CF value calculation [72].
Based on Equation (1), the value of the symptom is from 0.00 (untrue) to 1.00 (true) for the certainty factors (CFs).The rule scheme also allows for the inclusion of conditional explanations of CFs.When a rule's premise is unclear due to uncertain symptoms, and the conclusion is uncertain due to the specification rule, the CF calculation containing the CFrule (RuleCF) and CFuser (PremiseCF) must be completed.The final CF value, CFcombine, would determine the confidence level value [46].
Today, in the artificial intelligence (AI) era, the expert system was developed with the algorithm of machine learning [73], [74].Supervised learning is one of the branches of machine learning, that is, supervised learning (SL) based on label information [75], and the algorithm of machine learning would be trained on it [74].Besides, linear regression is the most common supervised learning method, a widely used and statistical-based procedure [76].Supervised learning tries to solve any uncertain things [77] that are commonly the expert system's main issue, including medical issues [78].Supervised learning methods operate on the assumption of building theories on existing data set instances to predict future data sets.A set of labelled cases is injected into a supervised learning algorithm that builds models for categorizing or anticipating future events [79].
Furthermore, research generally focuses on selecting the best regression model when analyzing regression methods and using regression equation models, which allows for determining the predictor or variable included in the regression process.The purpose of selecting the best regression model is usually in the interest of forecasting or prediction [80].Furthermore, regression analysis types, including multi-linear regression (MLR) and multiplication polynomial regression (MPR), are usually used to model predictions based on existing data sets.MLR focused on the incorporation of each predictor and response variable.In addition, the MPR ignored the predictor and response variables' involvement [76].
As an intelligent tool to help experts solve problems, expert systems must be evaluated for performance, including prediction performance.Evaluating performance is necessary to predict future cases [79] based on knowledge-based representation datasets.Besides, the medical expert system is required to improve the system performance, including using regression analysis [79], [81].
The existing COVID-19 expert systems, developed by Al Hakim et al. [21] and Suprayitno et al. [30], are limited performance, especially in the case of particular disease likelihood of new symptoms and COVID-like symptoms.In addition, every confirmed patient has different physiological conditions, so they probably show different clinical signs.Besides, the existing expert system must evaluate the prediction performance for future forecasting cases.This study proposed two different regression analyses, multi-linear regression (MLR) and multiplication polynomial regression (MPR), for a better forecasting model.So, this study proposed the following hypotheses: 1. H1: MLR regression model is the best forecasting model (if p-value < 0.05, one-tailed); 2. H2: MPR regression model is the best forecasting model (if p-value < 0.05, one-tailed).
Both hypotheses might be acceptable to propose the best forecasting model; this study also tried to get the best-fit model based on R-squared, F-value, and p-value.

Collecting, Acquisition, and Analysis of the Dataset
The study adopted COVID-19 symptoms data based on previous studies [21] (and to be developed in another mobile app version by Suprayitno et al. [30]).It authorized the identification of input functions (such as the confidence factor value (CF) based on symptoms collection and the confidence level (CL)) for decimal prediction of specific diseases based on symptoms information similar to COVID and the prediction of the impact of input functions and target variables.Input properties are studied using multi-linear regression (MLR) and multiplication polynomial regression (MPR), and this is one of the supervised machine learning algorithms used for regression tasks [29], [76], [82].According to a related study [83], while MPR wouldn't contain any predictor variables for the CL variable (Y2), MLR used CFrule (RuleCF) and CFuser (PremiseCF) as predictor variables (Xn) and confidence level (CL) as a response variable (Y1).It is crucial to anticipate future CF values, especially in the case of new patients based on this dataset, and it was utilized for the best prediction of the value of the CFrule and CFuser variables in conjunction with the CL variables that the regression equation can calculate.R Studio [84] was used in this study, including solving the MLR equation as seen in Equation (2), as well as the MPR equation that can be seen in the following Equation (3) [85].

Expert Knowledge Representation
A medical doctor (Aviasenna Andriand, MD) and supported medical references were adopted as knowledge-based representation datasets.Besides, a previous study by Al Hakim et al. [21] and improved by Suprayitno et al. [30] symptom datasets were also adopted for MYCIN rule-based algorithm calculation.To calculate the certainty factor value (CF), a rule-based method was utilized together with information from the user (diagnosed patient, then represented as PremiseCF or CFuser) and expert (medical doctor, then represented as RuleCF or CFrule).Each symptom was gathered in its section of the code.The dataset for this study was created by identifying each symptom as a code and determining that it had a CF between 0 and 1 decimally.

Modeling and Performance Evaluation
All gathered symptom CF values are labeled as COVID-like illness information.For its COVIDlike illness label, this parameter (CFrule and CFuser) used as predictor variables also generated the percentage of confidence level (CL, then utilized as a response variable).This dataset is extensively utilized to define the diagnosis's inference based on the inference machine employed, forward chaining, in the previous work by Al Hakim et al. [21] and Suprayitno et al. [30].
Al Hakim et al. [21] and Suprayitno et al. [30] categorized four COVID-like criteria: probable, suspect, positive, and negative.These studies also provided each criterion's confidence level (CL) %.Every ambiguity, including forecasts for COVID-like symptoms, must be made regularly.We attempted to utilize that dataset to predict continuous values based on a set of input attributes and to generate predictions on fresh data based on this dataset because such studies ( [21], [30]) were unfortunately not mentioned in this topic.Both the statistical analysis of multiple linear regression (MLR) and multiple polynomial regression (MPR) findings will be examined in this study.

Multiple Linear Regression Result
According to the results of the MLR analysis, the F-value is 188.133, the p-value is 0.00 < 0.05 (one-tailed), and the significance is also shown (Table 1 for statistical reports).Equation ( 4) displays the regression model used in this regression study.This result demonstrates that the constant value of -0.2125 occurs when the CL variable (Y1) is positive and the CFrule predictor has not been impacted by the CFuser predictor or is in a constant state.Additionally, the CFrule value (regression coefficient value) is 0.7116, which means that if it rises while other predictors, such as the CFuser, remain constant, the CL variable (Y1) will rise.Additionally, the CFuser value (regression coefficient value) of 0.3659 indicates that the CL variable (Y1) would rise if the CFuser value increased while the other variable (in this example, the CFrule value) remained constant.Based on Equation 4, the degree of variability of the predictors (CFrule and CFuser), which has been adjusted by the R-squared (coefficient of determination) value's weakness, is related to explaining its response variable (Y1, which denotes CL for the MLR regression analysis), according to the R-squared adjusted value.The CL value is affected significantly by the combination of the CF variables, or this value is generally regarded as having a solid effect size, as indicated by the R-squared adjusted result of 0.942, which shows that CFrule and CFuser have a 94.2% influence contributing to the CL value [86].The remaining 5.8% of additional factors were not investigated in this study, but we made the assumption that they were the result of including each COVID-like symptom related to the conclusion of the CF calculation (CFcombine); this implies that the certainty factor method must be used to compare the CFuser and CFrule to calculate the CFcombine, or final CF value [29].This study performs a linear regression model related to Al-Hakim and Andriand [87], which studies leptospirosis cases using only one linear regression model.Besides, Al-Hakim et al. [83] revealed findings comparable to those of this study, namely that the multiple linear regression (MLR) was effectively described by the linear regression model, allowing the model to be used to predict illness in the future.

Multiple Polynomial Regression Result
The F-value for the MPR analysis result is 11,239.334,and its significance is shown by the p-value of 0.00 < 0.05 (one-tailed) (see Table 2 for statistical reports).Equation ( 5) displays the regression model used in this regression study.With a one-tailed p-value of less than 0.05, this result demonstrates the significance of all regression coefficients.The coefficient of determination (R-squared) is 100%, and the R-squared adjusted value is 99.9%.According to Moore et al. [86], a substantial effect size is deemed to exist if the R-squared value exceeds 0.7 (70%).The suggested MPR regression model is the best in this investigation since it outperforms the MLR model.
The MPR regression model (Equation 5) shows that the predictors (CFrule and CFuser) are not involved with each other, which means that both the patient's symptoms (source of CFuser) after the doctor's confirmed symptoms (source of CFrule), are not related.This is important because each patient has a different body physiology condition, so it does not mean that the symptoms experienced by the patient always lead to the certainty of the doctor's anamnesis results.When this model (Equation 5) is used in new patients with new symptoms, it will better predict possible thyroid disorders in the future.

Conclusion and Recommendation
Based on the analysis of the two regression models, MLR (multi-linear or multiple linear regression) and MPR (multi-polynomial or multiple polynomial regression) can be applied to prediction models (both the p-values are lower than 0.05).However, MPR regression models provide better prediction capabilities for future cases of new patients, especially with symptoms similar to COVID-19.It is a foundation for developing an expert system focusing more on machine learning analysis methods (ML) than traditional rule-based expert systems.
However, our analysis does not guarantee the accuracy of the prediction of the existing expert system.It can only be used through statistical regression approaches to test which regression models are best for future predictions; of course, learning-assisted analysis is an alternative to expert system research, and hopefully, it will form the basis for further research related to improving the performance of expert systems.