Quick Overview
Accepted at the 19th International Conference on Artificial Intelligence in Medicine (AIME 2021), and published in Springer Lecture Notes in Artificial Intelligence.
In this work: (1) we developed an algorithm to extract qualitative rules from a causal/probabilistic model, (2) we applied our algorithm to a data set for predicting gestational diabetes from clinical observations, and (3) we verified the rules with the prior knowledge of a clinical expert in gynecology (Dr. Haas)—finding a precision of 0.923 and suggesting some mismatched cases could be future research directions.
Qualitative Knowledge in Machine Learning
This work combines two project categories our group investigates: “HumanAllied Artificial Intelligence” and “Precision Health.”
Consider the following statements:
 Risk of lung cancer increases with the number of cigarettes smoked
 Risk of high blood glucose decreases with each additional hour of sleep, up to eight
 Risk of gestational diabetes (GDM) increases with BMI
Each statement describes how a change in one variable influences the change in another variable. Such statements are a common result of medical studies, and may be highly influential in medical decision making. These are statements about monotonic influence (MI) between variables, denoted by $X_{\prec}^{M+}Y$. Previous work asked domain experts to provide these statements up front, and empirical results show that this inductive bias makes it possible to learn good models even in settings with little data.^{1}
Now consider these statements:
 Increase in BMI increases the risk of high blood pressure in patients with family history of hypertension more than in patients without such family history.
 Increase in blood sugar level increases the risk of heart attack in high cholesterol level patients more than it does in patients with low cholesterol levels.^{2}
These are statements about synergistic influnece (SI) between sets of variables ${A,B}_{\prec}^{S+}Y$.
Both types of knowledge have been successfully applied as inductive bias prior to learning. Here we are interested in a reverse problem: given a causal/probabilistic world model, can we reverseengineer the same kind of qualitative knowledge that experts have? This has potential to be an important component for getting machines to reason and explain themselves in ways that are amenable to how humans understand the world.
Standard rule mining doesn’t handle this case
Decision Rules are close in spirit.^{3} Assuming data is ordinal, categorical, or binned from continuous attributes; a rule such as the following could be observed:
$\text{if}~(X_{0} = 0 \land X_{1} = 0);~\text{then}$
$\quad Y = 0$
Substituting humanreadable names for the variables:
$\text{if}~(\text{Systolic Blood Pressure Category} = 0 \land \text{Diastolic Blood Pressure Category} = 0);~\text{then}$
$\quad \text{Blood Pressure Category} = 0$
Then converting the rule to natural language:
$\text{If a person’s systolic blood pressure is less than 120 and their}$ $\text{diastolic blood pressure is less than 80, then they have normal blood pressure.}$
But how should one write a statement from earlier: “Risk of gestational diabetes (GDM) increases with BMI” as a conjunctive rule? It is not obvious. Expressing this statement requires expressing all possible values of the body mass index (BMI) variable, expressing the possible values of gestational diabetes, weighing the rules by the probability of each outcome, and finally asserting that a monotonic increase in the former implies a monotonic increase in the latter.^{4}
The qualitative influence statement:
\[\text{BMI}_{\prec}^{M+}\text{GDM}\]is a much more general and concerns all possible values of both variables. This single rule concisely expresses an idea that it would take multiple decision rules.
Learning Qualitative Influence Statements: The QuaKE Algorithm
We develop metrics to measure the degree of monotonicity and synergy, denoted by $\delta_{a}$ for degree of monotonic influence and $\delta_{a,b}$ for degree of synergistic influence. Both have slack $\epsilon$ hyperparameters to allow some violation of the constraints, and threshold $T$ hyperparameters to tune the degree that should be considered a true QI statement.
Input:
$\quad$ joint probability model $P(Y, \boldsymbol{X})$,
$\quad$ label vector $Y$, ordinal feature vector $\boldsymbol{X}$,
$\quad$ monotonic and synergistic slack parameters: $\epsilon_{m}$, $\epsilon_{s}$
$\quad$ monotonic and synergistic threshold parameters $T_{m}$, $T_{s}$.
Initialize: $\boldsymbol{R} \leftarrow \emptyset$for $a \leftarrow 0$ to $(\boldsymbol{X}  1)$ do
$\quad$ compute $\delta_{a}$ using (Degree of monotonic influence calculation)
$\quad$ if $\delta_{a} \geq T_{m}$ then
$\qquad$ $\boldsymbol{R} \leftarrow$ ($X_{a\prec}^{M+}Y$) $\cup$ $\boldsymbol{R}$
$\quad$ for $b \leftarrow a + 1$ to $(\boldsymbol{X}  1)$ do
$\qquad$ compute $\delta_{a,b}$ using (Degree of synergistic influence calculation)
$\qquad$ if $\delta_{a,b} \geq T_{s}$ then
$\qquad \quad$ $\boldsymbol{R}$ $\leftarrow$ (${X_a,X_b}_{\prec}^{S+}Y$) $\cup$ $\boldsymbol{R}$
return $\boldsymbol{R}$
Degree of monotonic influence $\delta_{a}$ of a variable $X_{a} \in \boldsymbol{X}$ on $Y$ is defined as:
\[\delta_{a} = I_{(C_a>0)} \cdot \sum_j\sum_{j'>j}\sum_k \frac{P(Y \leq kX_a=x_a^j)  P(Y \leq kX_a=x_a^{j'})}{X_a}\]where,
\[C_{a} = \prod_{j}\prod_{j'>j}\prod_{k}{\max(P(Y \leq kX_a=x_a^j)  P(Y \leq kX_a=x_a^{j'}) + \epsilon_m, 0)}\]Degree of synergistic influence is defined in a similar way:
\[\delta_{a,b} = I_{(C_{a,b}>0)} \cdot \sum_i\sum_{i'>i}\sum_j\sum_{j'>j} \frac{\phi_{a,b}^{i,i',j,j'}}{X_a \cdot X_b}\]where,
\[C_{a,b} = \prod_i\prod_{i'>i}\prod_j\prod_{j'>j} \max(\phi_{a,b}^{i,i',j,j'} + \epsilon_s,0)\]and
\[\begin{split} \phi_{a,b}^{i,i',j,j'} = \sum_k P(Y \leq kX_a=x_a^i,X_b=x_b^j) & P(Y \leq kX_a=x_a^{i'},X_b=x_b^j)~ \\ P(Y \leq kX_a=x_a^{i}, X_b=x_b^{j'}) &+ P(Y \leq kX_a=x_a^{i'}, X_b=x_b^{j'}) \end{split}\]The reasoning for each of these are explained further in Section 2 of the paper, but briefly: the degree of monotonic influence measures the difference in probability of $Y$ for a variable $X_{a}$ across combinations of its values $x_{a}^{j}$ and $x_{a}^{j^{\prime}}$. The synergistic calculations are similar, but take into account the context specific influences of one variable in the presence of another.
Application to Gestational Diabetes
We applied this to a domain where the goal was to predict gestational diabetes from clinical observations.
We extracted rules using our algorithm, and asked David M. Haas to do a similar labeling over the rules. QuaKE scored a precision of 0.923 (over five cross validation folds) compared to Dr. Haas. We interpret this result as showing good agreement with clinical knowledge in most cases, but also speculated that cases where they disagree could be interesting directions for future study.
Citation
If you build on or use portions of this work, please provide credit using the following reference or BibTeX.
Athresh Karanam, Alexander L. Hayes, Harsha Kokel, David M. Haas, Predrag Radivojac, and Sriraam Natarajan. (2021) A Probabilistic Approach to Extract Qualitative Knowledge for Early Prediction of Gestational Diabetes. In: Artificial Intelligence in Medicine. AIME 2021.
@inproceedings{karanam2021quake,
author = {Athresh Karanam and Alexander L. Hayes and Harsha Kokel and David M. Haas and Predrag Radivojac and Sriraam Natarajan},
title = {A Probabilistic Approach to Extract Qualitative Knowledge for Early Prediction of Gestational Diabetes},
year = {2021},
booktitle = {Artificial Intelligence in Medicine}
}
Acknowledgements
We gratefully acknowledge the support of 1R01HD101246 from NICHD and Precision Health Initiative of Indiana University. Thanks to Rashika Ramola and Rafael Guerrero for advice during the data processing phase and their helpful discussions and feedback.
Footnotes

Altendorf et al. presented results of monotonic constraints that included the UCI Breast Cancer Wisconsin data set as an extreme case of this, scoring 90% accuracy with a single example and qualitative background knowledge in the form of monotonicities—this showed that an informed learner with a single example could be equivalent to an uninformed learner given hundreds of examples. For full details, see learning curves in Figure 8 of “Learning from Sparse Data by Exploiting Monotonicity Constraints.” ↩

From Yang and Natarajan (2013) “Knowledge Intensive Learning: Combining Qualitative Constraints with Causal Independence for Parameter Learning in Probabilistic Models” ↩

The discussion presented here is overly concise and ignores differences in model learning. Decision rules and decisions trees often strike a good balance between effectiveness and interpretability. For a longer discussion on learning and interpreting decision rules, Christoph Molnar’s summary can be a good place to look: Interpretable Machine Learning > Interpretable Models > Decision Rules. ↩

We speculated that this could still be a viable option. Gopalakrishnan [2010] studied a related problem and proposed a “Bayesian Rule Extraction” algorithm. This learned the structure and parameters of a Bayesian Network (BN) then extracted rules by comparing confidence factors between binary outcomes conditioned on the same evidence. It may be possible to extract qualitative rules by running the Gopalakrishnan algorithm on learned BNs and checking whether the confidence factors increase (decrease) as the factors increase (decrease). For full details, refer to their paper on “Bayesian rule learning for biomedical data mining.” Alexander L. Hayes implemented a version of this, an overview and example of applying the method is included in the README of this GitHub repository. ↩