In this tutorial we will learn some of the basics of Bayesian Networks by solving a simulated problem. Our agenda will include:
tidyverse
and bnlearn
)set.seed(1020)
You should know the tidyverse, it makes R easier and more fun.
You can get more information on https://www.tidyverse.org
I really suggest you to take Hadley’s Wickham Data Science course http://r4ds.had.co.nz
In this tutorial I will be using the following tidyverse utilities:
dplyr
in generalggplot2
for almost any plottidyr
specifically the gather
functionlubridate
to extract components from Datesread_r
to read data from csv filesIn order to make the tutorial work you should install and load the following packages
library(tidyverse)
library(lubridate)
The bnlearn
package is the most complete and popular package for Bayesian Networks available to the date in R. We will start our tutorial by reviewing some of its capacities.
First we should install and load the package:
library(bnlearn)
*Please make sure that you either download or compile at least bnelarn 4.2.x
as previous versions might have some bugs.
You should check out the documentation that is available online [Link]. It contains nice examples and additional use cases beyond this tutorial, if you are interested in this topic there are a couple of related books written by the author of the package, you can find the references in the mentioned website.
The usage of bnlearn revolves around the usage of two main data structures to represent a Bayesian Network in different stages (NOTE that these are S3 classes and the names might overlap with other functions):
There are different ways to manually initialize and modify the graph of a Bayesian Network.
We can create empty graphs from a set of variables:
vars <- LETTERS[1:6]
dag <- empty.graph(vars)
dag
Random/Generated Bayesian network
model:
[A][B][C][D][E][F]
nodes: 6
arcs: 0
undirected arcs: 0
directed arcs: 0
average markov blanket size: 0.00
average neighbourhood size: 0.00
average branching factor: 0.00
generation algorithm: Empty
You can specify arcs sets in various ways. The most common are:
As a two column (from
, to
) matrix, setting them via arcs
to an existing network:
e <- matrix(
c("A", "C", "B", "F", "C", "F"),
ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("from", "to"))
)
arcs(dag) <- e
dag
Random/Generated Bayesian network
model:
[A][B][D][E][C|A][F|B:C]
nodes: 6
arcs: 3
undirected arcs: 0
directed arcs: 3
average markov blanket size: 1.33
average neighbourhood size: 1.00
average branching factor: 0.50
generation algorithm: Empty
We can also use an adjancecy matrix, and assign it to a dag with amat
:
adj <- matrix(
0L,
ncol = 6,
nrow = 6,
dimnames = list(vars, vars)
)
adj["A", "C"] = 1L
adj["B", "F"] = 1L
adj["C", "F"] = 1L
adj["D", "E"] = 1L
adj["A", "E"] = 1L
print(adj)
A B C D E F
A 0 0 1 0 1 0
B 0 0 0 0 0 1
C 0 0 0 0 0 1
D 0 0 0 0 1 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0
amat(dag) <- adj
dag
Random/Generated Bayesian network
model:
[A][B][D][C|A][E|A:D][F|B:C]
nodes: 6
arcs: 5
undirected arcs: 0
directed arcs: 5
average markov blanket size: 2.33
average neighbourhood size: 1.67
average branching factor: 0.83
generation algorithm: Empty
The last option is to create a formula for a given set of variables. Each node is specified by a pair of brackets [<var_name>]
. If the node has a parentset we denote it by |
and specify the list of parents separated by colons :
. We can compute the formula into a bn
object with model2network
.
dag <- model2network("[A][C][B|A][D|C][F|A:B:C][E|F]")
dag
Random/Generated Bayesian network
model:
[A][C][B|A][D|C][F|A:B:C][E|F]
nodes: 6
arcs: 6
undirected arcs: 0
directed arcs: 6
average markov blanket size: 2.67
average neighbourhood size: 2.00
average branching factor: 1.00
generation algorithm: Empty
The package provide other capabilities such as random graphs and graph sampling. In addition there is a range of utilities to manipulate the graphs [Link] [Link].
We can ploting graphs using the built in R engine by using plot
for the bn
class:
plot(dag)
Minimal aspects of the plot can be customized as documented in the corresponding help page. Other packages can be used indrectly to plot graphs, bnlearn
provides connections with some of them but be aware that some of them might be outdated.
Fortunatelly, graphs are a common data structure and we can find lots of utilities to work with them. One of the most common visualization tools for graphs, and for data in general, is the D3 library from the Javascript domain. Thanks to the integration of R shiny with web technologies we can find wonderful ports such as the networkD3
package.
The next snippet is just a custom function to transform a bn
object to the required information to plot a D3 force graph.
Please install the networkD3
, be carefull as this package is currently under beta development and future version could break this code.
plotD3bn <- function(bn) {
varNames <- nodes(bn)
# Nodes should be zero indexed!
links <- data.frame(arcs(bn)) %>%
mutate(from = match(from, varNames)-1, to = match(to, varNames)-1, value = 1)
nodes <- data.frame(name = varNames) %>%
mutate(group = 1, size = 30)
networkD3::forceNetwork(
Links = links,
Nodes = nodes,
Source = "from",
Target = "to",
Value = "value",
NodeID = "name",
Group = "group",
fontSize = 20,
zoom = TRUE,
arrows = TRUE,
bounded = TRUE,
opacityNoHover = 1
)
}
We can now plot our last generated BN. If the result looks too small you can zoom in and out using your mouse wheel.
plotD3bn(dag)
There are different file formats to represent a Bayesian network. They have originated over the years as an effort to create standards or as part of particular propietary systems. Despite the effort there is not a consensuated closed standard, perhaps the closest one is the xbif
format [Link].
bnlearn
provides several ways to load BNs from different formats [Docs]. You should use them when importing and exporting to other systems or to share them with other people, if you plan to use just bnlearn you could just save an rda
file.
The maintainers of bnlearn
also provide a modern R-focused repository for a series of popular Bayesian networks that have been used extensivelly for benchmarking on the literature.
http://www.bnlearn.com/bnrepository/
In here you can find networks with different properties that can be use to test algorithms or explore this or other BN packages. We will now work with the so popular asia
network: Asia is for BNs what “iris” is for statistics.
# This downloads the RData file from the repository and loads it.
# The bn is loaded into a bn.fit variable called bn
load(url("http://www.bnlearn.com/bnrepository/asia/asia.rda"))
asia <- bn
We will know explore the contents of this variable. As you can see, this not only contains the structure but the parameters as well, but first we will take a look at the structure:
bn.net(asia)
Random/Generated Bayesian network
model:
[asia][smoke][tub|asia][lung|smoke][bronc|smoke][either|tub:lung][xray|either]
[dysp|bronc:either]
nodes: 8
arcs: 8
undirected arcs: 0
directed arcs: 8
average markov blanket size: 2.50
average neighbourhood size: 2.00
average branching factor: 1.00
generation algorithm: Empty
plotD3bn(asia)
Now is the time to review the parameters, this prints each node and the asociated probability table. In this case all variables are discrete so the tables would be conditional probability tables.
asia
Bayesian network parameters
Parameters of node asia (multinomial distribution)
Conditional probability table:
yes no
0.01 0.99
Parameters of node tub (multinomial distribution)
Conditional probability table:
asia
tub yes no
yes 0.05 0.01
no 0.95 0.99
Parameters of node smoke (multinomial distribution)
Conditional probability table:
yes no
0.5 0.5
Parameters of node lung (multinomial distribution)
Conditional probability table:
smoke
lung yes no
yes 0.10 0.01
no 0.90 0.99
Parameters of node bronc (multinomial distribution)
Conditional probability table:
smoke
bronc yes no
yes 0.6 0.3
no 0.4 0.7
Parameters of node either (multinomial distribution)
Conditional probability table:
, , tub = yes
lung
either yes no
yes 1 1
no 0 0
, , tub = no
lung
either yes no
yes 1 0
no 0 1
Parameters of node xray (multinomial distribution)
Conditional probability table:
either
xray yes no
yes 0.98 0.05
no 0.02 0.95
Parameters of node dysp (multinomial distribution)
Conditional probability table:
, , either = yes
bronc
dysp yes no
yes 0.9 0.7
no 0.1 0.3
, , either = no
bronc
dysp yes no
yes 0.8 0.1
no 0.2 0.9
We can access individual nodes of the net as in a data.frame:
asia$smoke
Parameters of node smoke (multinomial distribution)
Conditional probability table:
yes no
0.5 0.5
There is also a function to plot the distributions of discrete networks:
bn.fit.barchart(asia$smoke)
bn.fit.barchart(asia$dysp)
We can alter the probability tables of a BN, this is useful to override parameters learnt from data or not observed variables. This method allows us to include expert information from the domain of the problem modelled.
To modify a conditional probability table you can just directly replace the existing table in the model by extracting it with coef
. Be careful to maintain the inherent restrictions of the probability distribution.
cpt <- coef(asia$smoke)
cpt[] <- c(0.2, 0.8)
asia$smoke <- cpt
asia$smoke
Parameters of node smoke (multinomial distribution)
Conditional probability table:
yes no
0.2 0.8
bnlearn
introduces an R like function to sample data from a given fitted model rbn
. We will now sample a dataset from asia to test learning from data.
# Note that the order of the parameters is inverted from the R sample functions
sampleAsia <- rbn(x = asia, n = 10000)
head(sampleAsia)
We can induce the parameters of a Bayesian Network from observed data. bnlearn
provides different algorithms for that, while Maximum Likelihood Estimation (MLE) is the most common one.
We can invoke the learning algorithm by using the function bn.fit
. For that we need a DAG and a compatible dataset:
net <- bn.net(asia)
asiaInduced <- bn.fit(x = net, data = sampleAsia)
We can now compare the two networks, there should be some discrepacies in the induced one. Notice that extremelly marginal probabilities will not be simulated and thus will not have a significant present in the sample.
asiaInduced
Bayesian network parameters
Parameters of node asia (multinomial distribution)
Conditional probability table:
yes no
0.0103 0.9897
Parameters of node tub (multinomial distribution)
Conditional probability table:
asia
tub yes no
yes 0.04854369 0.01040719
no 0.95145631 0.98959281
Parameters of node smoke (multinomial distribution)
Conditional probability table:
yes no
0.2008 0.7992
Parameters of node lung (multinomial distribution)
Conditional probability table:
smoke
lung yes no
yes 0.09711155 0.01051051
no 0.90288845 0.98948949
Parameters of node bronc (multinomial distribution)
Conditional probability table:
smoke
bronc yes no
yes 0.6000996 0.2936687
no 0.3999004 0.7063313
Parameters of node either (multinomial distribution)
Conditional probability table:
, , lung = yes
tub
either yes no
yes 1 1
no 0 0
, , lung = no
tub
either yes no
yes 1 0
no 0 1
Parameters of node xray (multinomial distribution)
Conditional probability table:
either
xray yes no
yes 0.96850394 0.05291610
no 0.03149606 0.94708390
Parameters of node dysp (multinomial distribution)
Conditional probability table:
, , either = yes
bronc
dysp yes no
yes 0.91847826 0.68020305
no 0.08152174 0.31979695
, , either = no
bronc
dysp yes no
yes 0.80938242 0.09854423
no 0.19061758 0.90145577
In many ocasions the structure of the model is designed by hand. We usually know the variables of the problems and the different causal patterns from expert knowledge, this descriptive information can be tranfered to the model. This is what we call an open box model and provides a powerful framework for many problems, specially when compared to other models that do not provide a clear interpretation of their parameters.
However, there are many situations in which we would like to automatize the structural learning of a model such as causal patterns discovery or a lack of knowledge of the domain.
In other cases we just want to learn about particular dependency relationships between the variables or select the best structure among a particular set.
bnlearn
specializes in structural learning, and for this reason it provides a range of algorithms to learn the models. There is a complex taxonomy of such algorithms related to the statistical tests, metrics and heuristics in which they are based. Exact learning is a NP-hard problem and thus several approaches have been proposed.
We will briefly study the hc
algorithm to learn a full structure and the BIC
score metric to measure the fit of a particular network with a dataset.
The hc
algorithm can be simply run from a sample of data, although many options can be set to optimize it for a particular problem.
networkInduced <- hc(x = sampleAsia)
networkInduced
Bayesian network learned via Score-based methods
model:
[asia][tub][smoke][lung|smoke][bronc|smoke][either|tub:lung][xray|either][dysp|bronc:either]
nodes: 8
arcs: 7
undirected arcs: 0
directed arcs: 7
average markov blanket size: 2.25
average neighbourhood size: 1.75
average branching factor: 0.88
learning algorithm: Hill-Climbing
score: BIC (disc.)
penalization coefficient: 4.60517
tests used in the learning procedure: 77
optimized: TRUE
Lets compare it with the original network golden model as the algorithm may have introduced some differences given that we used a small data sample.
modelstring(asia)
[1] "[asia][smoke][tub|asia][lung|smoke][bronc|smoke][either|lung:tub][xray|either][dysp|bronc:either]"
We can also compute some metrics to compare the network. The structural Hamming distance determines the amount of discrepancy between the two graphs.
shd(bn.net(asia), networkInduced)
[1] 1
The BIC metric measures the fit of the structure for a given sample, it also penalizes the number of parameters to avoid overfitting. In this case the result is almost the same, the lower the metric the better, so it seems that the induced model could be biased towards the sample and marginally outperforms the golden model.
print(BIC(object = asia, data = sampleAsia))
[1] -19437.54
print(BIC(object = networkInduced, data = sampleAsia))
[1] -19432.09
Gaussian networks differ in the class of probability tables that represent them. If all nodes are gaussian we will find Gaussian nodes and conditional Gaussian nodes. Gaussian nodes are encoded by the normal distribution parameters (mean and sd), conditional gaussian are represented as linear reggresion with a coef for each parent, an intercept term and standard deviation of the residuals.
bnlearn
has some sample gaussian data to test these BNs:
data(gaussian.test)
dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
model <- bn.fit(dag, gaussian.test)
model
Bayesian network parameters
Parameters of node A (Gaussian distribution)
Conditional density: A
Coefficients:
(Intercept)
1.007493
Standard deviation of the residuals: 1.004233
Parameters of node B (Gaussian distribution)
Conditional density: B
Coefficients:
(Intercept)
2.039499
Standard deviation of the residuals: 3.034111
Parameters of node C (Gaussian distribution)
Conditional density: C | A + B
Coefficients:
(Intercept) A B
2.001083 1.995901 1.999108
Standard deviation of the residuals: 0.5089772
Parameters of node D (Gaussian distribution)
Conditional density: D | B
Coefficients:
(Intercept) B
5.995036 1.498395
Standard deviation of the residuals: 0.3286672
Parameters of node E (Gaussian distribution)
Conditional density: E
Coefficients:
(Intercept)
3.493906
Standard deviation of the residuals: 1.98986
Parameters of node F (Gaussian distribution)
Conditional density: F | A + D + E + G
Coefficients:
(Intercept) A D E G
-0.006047321 1.994853041 1.005636909 1.002577002 1.494373265
Standard deviation of the residuals: 0.9961475
Parameters of node G (Gaussian distribution)
Conditional density: G
Coefficients:
(Intercept)
5.028076
Standard deviation of the residuals: 1.983631
To modify a gaussian network node we proceed as in the discrete caser:
model$A <- list(coef = c("(Intercept)" = 10), sd = 0)
model$A
Parameters of node A (Gaussian distribution)
Conditional density: A
Coefficients:
(Intercept)
10
Standard deviation of the residuals: 0
An hybrid network contains both discrete and continuous variables. There is a framework restriction in which a discrete variable cannot have any continuous parent. The parameters are modelled as in the previous cases.
A continuous variable with discrete parents is represented by a conditional Gaussian distribution, where there is a linear gaussian distribution (according to any continuous parents) for each configuration of the discrete parents.
In the next example we will use custom.fit
to manually load the parameters into a graph:
net <- model2network("[A][B][C|A:B]")
cptA <- matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("LOW", "HIGH")))
distB <- list(coef = c("(Intercept)" = 1), sd = 1.5)
distC <- list(
coef =
matrix(
c(1.2, 2.3, 3.4, 4.5),
ncol = 2,
dimnames = list(c("(Intercept)", "B"), NULL)
),
sd = c(0.3, 0.6)
)
model = custom.fit(net, dist = list(A = cptA, B = distB, C = distC))
model
Bayesian network parameters
Parameters of node A (multinomial distribution)
Conditional probability table:
LOW HIGH
0.4 0.6
Parameters of node B (Gaussian distribution)
Conditional density: B
Coefficients:
(Intercept)
1
Standard deviation of the residuals: 1.5
Parameters of node C (conditional Gaussian distribution)
Conditional density: C | A + B
Coefficients:
0 1
(Intercept) 1.2 3.4
B 2.3 4.5
Standard deviation of the residuals:
0 1
0.3 0.6
Discrete parents' configurations:
A
0 LOW
1 HIGH
The inference engine of bnlearn
is limited, but it can be used to test the networks and to perform basic operations with the models.
By using cpquery
we can ask for the probability of an event given a set of evidence, both of them are boolean expressions involving the variables in the model. We may ask for a particular combination of configurations in the BN and a set of observed statuses for the variables in the evidence.
For example we could ask, what is the posibility of a positive cancer diagnosis for a person that smokes?, in the asia network.
(For cpquery I recommend the most powerfull lw
algorithm)
# First we should observe the prior probability to compare
# TRUE is for empty evidence
cpquery(asia, event = lung == "yes", evidence = TRUE)
[1] 0.0266
cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)
[1] 0.10054
Notice that the method is not much stable, so it is better to aggregate a repeated sample of predictions:
query <- cpquery(asia, event = lung == "yes", evidence = TRUE)
print(mean(rep(query, 1000)))
[1] 0.0216
query <- cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)
print(mean(rep(query, 1000)))
[1] 0.0997
The other option is to use cpdist
to sample cases from the network for a set of nodes in the presence of some evidence, the usage is the same, and we can obtain more stability by increasing the size of the sample.
s <- cpdist(asia, nodes = c("lung"), evidence = TRUE, n=1000)
head(s)
We can even compute or plot of the distribution:
summary(s)
lung
yes: 21
no :979
prop.table(table(s))
s
yes no
0.021 0.979
ggplot(s, aes(x=lung)) + geom_bar()
Now that we have a nice overview on the operation with bnlearn
we will switch to an applied problem in which we want to build a model from a real context.
We are part of a Data Science team from an IoT company that has developed new Temperature sensor to monitor data center rooms in order to detect and predict early refrigeration failures. The sensors were installed a year ago, and we need to create a model that would be able to detect specific heat outbreaks or global heating problems in the room.
We are provided with a dataset consisting on a number of measures per day during the previous year for a total of 3 sensors that have been places in strategic places across room.
We will start to work on the first version of the problem heatAlarm-lvl1.csv
.
dataRaw <- read_csv("./data/heatAlarm-lvl1.csv")
Parsed with column specification:
cols(
Date = col_date(format = ""),
Hour = col_character(),
TS1 = col_double(),
TS2 = col_double(),
TS3 = col_double()
)
head(dataRaw)
We will start by studying the properties of our problem at hand. First we should look into the distribution of the dataset.
summary(dataRaw)
Date Hour TS1 TS2 TS3
Min. :2016-01-01 Length:1830 Min. : 4.627 Min. :12.44 Min. : 8.89
1st Qu.:2016-04-01 Class :character 1st Qu.:12.745 1st Qu.:21.41 1st Qu.:17.61
Median :2016-07-01 Mode :character Median :16.979 Median :26.09 Median :22.15
Mean :2016-07-01 Mean :17.011 Mean :26.12 Mean :22.19
3rd Qu.:2016-10-01 3rd Qu.:21.384 3rd Qu.:30.96 3rd Qu.:26.88
Max. :2016-12-31 Max. :31.276 Max. :41.89 Max. :37.50
We see that we have a complete year of measures and different ranges for each sensor. Now it is interesting to see the completeness of the data, we would also like to se its distribution along dates.
dataRaw %>%
group_by(Date) %>%
summarise(Measures = n()) %>%
ggplot(aes(x=Measures)) + geom_histogram()
We confirm that the number of measures is stable during the day. We would now observe the distribution of the Temperature measures for each sensor.
dataRaw %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
geom_line()
dataRaw %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
geom_smooth()
dataRaw %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
geom_boxplot()
We observe three different distributions for each sensor. Visually they appear to be some linear variation, in fact, a simple additive relationship that could be modelled by adding a scalar intercept value. This dependency relationship can be modelled in a BN, but, what is the best way to do it?
The solution is elegant from the computational and the statistical point of view. We can think that the sensors are measuring some real and not observable temperature from the environment. This hidden value conditions the exact measures that the sensors capture, and thus can model its individual distributions.
In other words, our model requires an additional variable, more concretelly a Temp latent variable that cannot be observed but that is captured by each sensor by following a particular dependency, we will represent this by adding a new variable to our data (we will also remove unused variables).
dataLatent <- dataRaw %>%
mutate(Temp = NA) %>%
select(Temp, TS1, TS2, TS3) %>%
as.data.frame()
dataLatent$Temp <- as.numeric(dataLatent$Temp)
The next step is to estimate this variable to complete our model. We can learn the three network distributions from the data but not the Temp variable. For that we would use the EM
algorithm. First, lets build the model:
heatAlarmDag1 <- model2network("[Temp][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag1)
Now for the EM, unfortunatelly bnlearn
does not provide a standalone implementation of this algorithm (it does for structural learning EM which is a superset), but it provides the tools to to so. Basically, the parametric EM algorithm can be implemented by:
As you can imagine these steps would be repetead though a number of iterations or convergence. To impute the latent variables we will use the function inpute
, however, we would need an initial model to do so. This is another important element for the EM algorithm, as we would need a good guess to ensure a proper fit.
Lets first try with a random initialization of Temp
# Lets add uniform Temp
dataImputed <- dataLatent %>% rowwise() %>% mutate(Temp = runif(1, 10, 25))
dataImputed %>%
ggplot(aes(x=Temp)) + geom_density()
The EM algorithm is then plain simple
parametric.em <- function(dag, dataLatent, dataImputed, iter = 5) {
fitted <- bn.fit(dag, dataImputed, method = "mle")
for ( i in seq(iter)) {
complete <- impute(fitted, data = dataLatent, method = "bayes-lw")
fitted <- bn.fit(dag, complete, method = "mle")
}
fitted
}
Let’s try it
heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
Lets see the resulting model:
heatAlarmModel1
Bayesian network parameters
Parameters of node Temp (Gaussian distribution)
Conditional density: Temp
Coefficients:
(Intercept)
17.43173
Standard deviation of the residuals: 0.3912403
Parameters of node TS1 (Gaussian distribution)
Conditional density: TS1 | Temp
Coefficients:
(Intercept) Temp
247.33610 -13.21297
Standard deviation of the residuals: 0.04914744
Parameters of node TS2 (Gaussian distribution)
Conditional density: TS2 | Temp
Coefficients:
(Intercept) Temp
280.69264 -14.60381
Standard deviation of the residuals: 0.05432086
Parameters of node TS3 (Gaussian distribution)
Conditional density: TS3 | Temp
Coefficients:
(Intercept) Temp
269.48291 -14.18656
Standard deviation of the residuals: 0.05276883
# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()
Look at those intercept coefficient, they do not look right. As we can see in the plot it seems like a forced fit, where the Temp variable has a very different distribution. Lets try another guess at the prior for Temp, this time we will use a Gaussian distribution with the ststistics drawn from the sensors.
statistics <- dataLatent %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
summarise(
mu = mean(Measure),
sigma = sd(Measure)
)
dataImputed <- dataLatent %>%
rowwise() %>%
mutate(Temp = rnorm(1, statistics$mu, statistics$sigma))
dataImputed %>%
ggplot(aes(x=Temp)) + geom_density()
Let’s repeat the EM:
heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
heatAlarmModel1
Bayesian network parameters
Parameters of node Temp (Gaussian distribution)
Conditional density: Temp
Coefficients:
(Intercept)
21.75432
Standard deviation of the residuals: 0.1244617
Parameters of node TS1 (Gaussian distribution)
Conditional density: TS1 | Temp
Coefficients:
(Intercept) Temp
-886.55599 41.53506
Standard deviation of the residuals: 0.04063577
Parameters of node TS2 (Gaussian distribution)
Conditional density: TS2 | Temp
Coefficients:
(Intercept) Temp
-972.55651 45.90717
Standard deviation of the residuals: 0.04491322
Parameters of node TS3 (Gaussian distribution)
Conditional density: TS3 | Temp
Coefficients:
(Intercept) Temp
-947.95912 44.59554
Standard deviation of the residuals: 0.04362999
# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()
This time the fit is even worse, as we overfitted a bad prior. We will try to make a better guess, this time depedant on the actual value of the sensors.
dataImputed <- dataLatent %>%
rowwise() %>%
mutate(Temp = mean(c(TS1, TS2, TS3)))
dataImputed %>%
ggplot(aes(x=Temp)) + geom_density()
heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
heatAlarmModel1
Bayesian network parameters
Parameters of node Temp (Gaussian distribution)
Conditional density: Temp
Coefficients:
(Intercept)
21.77576
Standard deviation of the residuals: 5.46792
Parameters of node TS1 (Gaussian distribution)
Conditional density: TS1 | Temp
Coefficients:
(Intercept) Temp
-3.5759540 0.9454084
Standard deviation of the residuals: 0.05232406
Parameters of node TS2 (Gaussian distribution)
Conditional density: TS2 | Temp
Coefficients:
(Intercept) Temp
3.368793 1.044925
Standard deviation of the residuals: 0.05783186
Parameters of node TS3 (Gaussian distribution)
Conditional density: TS3 | Temp
Coefficients:
(Intercept) Temp
0.08259935 1.01507008
Standard deviation of the residuals: 0.05617952
# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()
Now it seems that we got a nice fit, the density graph represents the sensors as scalar transformations of the latent Temp variable.
If we go back to the plot showing the evolution of the temperature along the year we can see that there could be different distributions for different periods of time. If we can obtain this discriminative information we could do better modeling the domain.
We will start by discriminating the distribution by month:
dataByMonth <- dataRaw %>%
mutate(Month = as.factor(month(Date, label = TRUE, abbr = TRUE)))
dataByMonth %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) +
geom_density(alpha=0.5) +
facet_wrap(~Month)
It really seems that the month can capture additional information about the distribution. In this case, we would obtain a better fit to the data if we condition the Month as a discrete parent for the Temp variable. In this case we would need parameters for 12 different gaussian distributions, which may be a bit overkill.
Let’s try if we could summarise the month information by creating a lower dimensional variable. If we think about the weather domain we rapidly come up with the year’s seasons. Let’s try to discriminate the distributions by season.
# Lubridate cant extract seasons so we need a custom function
# We better encode the states as a dict to avoid magic strings
SeasonStates <- list(
Winter = "winter",
Spring = "spring",
Summer = "summer",
Fall = "fall"
)
season <- function(date){
WS <- as.Date("2012-12-15", format = "%Y-%m-%d") # Winter Solstice
SE <- as.Date("2012-3-15", format = "%Y-%m-%d") # Spring Equinox
SS <- as.Date("2012-6-15", format = "%Y-%m-%d") # Summer Solstice
FE <- as.Date("2012-9-15", format = "%Y-%m-%d") # Fall Equinox
# Convert dates from any year to 2012 dates
d <- as.Date(strftime(date, format="2012-%m-%d"))
factor(
ifelse (d >= WS | d < SE, SeasonStates$Winter,
ifelse (d >= SE & d < SS, SeasonStates$Spring,
ifelse (d >= SS & d < FE, SeasonStates$Summer, SeasonStates$Fall))),
levels = SeasonStates
)
}
dataBySeason <- dataByMonth %>%
mutate(Season = season(Date))
dataBySeason %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) +
geom_density(alpha=0.5) +
facet_wrap(~Season)
The distributions for the season discrimination seems less Gaussian, but they encode the same pattern. Lets try to model both networks and see how well they fit the parameters.
First we will model the Month based one :
dataLatentMonth <- dataByMonth %>%
mutate(Temp = NA) %>%
select(Month, Temp, TS1, TS2, TS3) %>%
as.data.frame()
dataLatentMonth$Temp <- as.numeric(dataLatentMonth$Temp)
And now we model the network DAG
heatAlarmDagMonth <- model2network("[Month][Temp|Month][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDagMonth)
Lets repeat the EM example and induce the parameters of the hidden variable with the previous successful configuration.
dataImputedMonth <- dataLatentMonth %>%
rowwise() %>%
mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
as.data.frame()
heatAlarmModelMonth <- parametric.em(heatAlarmDagMonth, dataLatentMonth, dataImputedMonth, iter = 10)
dataImputeMonthPosterior <- impute(heatAlarmModelMonth, data = dataLatentMonth, method = "bayes-lw")
dataImputeMonthPosterior %>%
gather("Sensor", "Measure", Temp, TS1, TS2, TS3) %>%
ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) +
geom_density(alpha=0.5) +
facet_wrap(~Month)
This is a very good fit indeed. We can see that Temp is almost identical to TS3, and that the other two sensors are just transformations of this variable. Lets check it the season is a good simplification of the Month variable.
dataLatentSeason <- dataBySeason %>%
mutate(Temp = NA) %>%
select(Season, Temp, TS1, TS2, TS3) %>%
as.data.frame()
dataLatentSeason$Temp <- as.numeric(dataLatentSeason$Temp)
heatAlarmDagSeason <- model2network("[Season][Temp|Season][TS1|Temp][TS2|Temp][TS3|Temp]")
dataImputedSeason <- dataLatentSeason %>%
rowwise() %>%
mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
as.data.frame()
With that configuration we run the EM
heatAlarmModelSeason <- parametric.em(heatAlarmDagSeason, dataLatentSeason, dataImputedSeason, iter = 10)
dataImputeSeasonPosterior <- impute(heatAlarmModelSeason, data = dataLatentSeason, method = "bayes-lw")
dataImputeSeasonPosterior %>%
gather("Sensor", "Measure", Temp, TS1, TS2, TS3) %>%
ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) +
geom_density(alpha=0.5) +
facet_wrap(~Season)
Another good fit. Which one of the two approaches is better?. The initial guess should be the Season one, as it has fewer parameters than using the Month. Let’s ask the BIC score.
print(BIC(heatAlarmModelSeason, dataImputeSeasonPosterior))
[1] -1004.384
print(BIC(heatAlarmModelMonth, dataImputeMonthPosterior))
[1] -4853.787
As expected, season is by far more superior at minimizing BIC than month and should be our pick.
One of the most appealing aspects of Bayesian Networks is the ability to include expert knowledge in the model. While this is almost impossible in other frameworks it is a natural concepts in this case.
We call BNs open box models because we can understand and interpretate the parameters. This is really useful for modelling anomalies, because we can use latent variables and expert knowledge to represent events that cannot be observed.
If the events cannot be observed we would not have available data to learn from. In our example we would have to destroy machines and lit the room on fire to capture data about anomalies, and that would surely have a high cost. Imagine other scenarios such as rare diseases, autonomous driving or financial fraud.
We will start by modeling a global fault by adding a new variable Alarm
, this variable will have two states yes
and no
. In this example we have been collecting data from the sensors for a year, we assume that the functioning has been correct and that there has not been recorder failures (perhaps because the expert technicians that maintain the room have assured us so).
Given this, the Alarm variable will be always observed to no
.
# New Variables
AlarmStates <- list(
No = "no",
Yes = "yes"
)
dataLatent <- dataLatentSeason %>%
mutate(Alarm = factor(AlarmStates$No, levels = AlarmStates)) %>%
select(Alarm, Season, Temp, TS1, TS2, TS3) %>%
as.data.frame()
And now the new dag:
heatAlarmDag <- model2network("[Alarm][Season][Temp|Season:Alarm][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag)
Lets run the EM algorithm one more time, ther should be no changes to the parameters but it will reinforce our logical pipeline.
dataImputed <- dataLatent %>%
rowwise() %>%
mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
as.data.frame()
heatAlarmModel <- parametric.em(heatAlarmDag, dataLatent, dataImputed, iter = 10)
Lets check the result on the Alarm variable:
heatAlarmModel$Alarm
Parameters of node Alarm (multinomial distribution)
Conditional probability table:
no yes
1 0
Obviously the algorithm will fit all the cases to the observed population in which no alarms have occurred. Here is where the expert knowledge enters. We will contact our experts and design a proper parametrization of the variable.
“The system has a random failure rate of 1/1000, in such cases, the room temperature is quickly raised about 10”.
We have enough information to include this expert knowledge into de Alarm variable. Notice that this variable introduces additional parameters into the Temp variable that should be modelled to.
cptAlarm <- coef(heatAlarmModel$Alarm)
print(cptAlarm)
no yes
1 0
cptAlarm[] <- c(0.999, 0.001)
heatAlarmModel$Alarm <- cptAlarm
print(heatAlarmModel$Alarm)
Parameters of node Alarm (multinomial distribution)
Conditional probability table:
no yes
0.999 0.001
Now for the Temp variable, which a little more complex
cgaussTemp <- coef(heatAlarmModel$Temp)
sdTemp <- heatAlarmModel$Temp$sd
print(cgaussTemp)
0 1 2 3 4 5 6 7
(Intercept) 16.68242 NaN 21.16419 NaN 28.02095 NaN 21.01119 NaN
print(sdTemp)
0 1 2 3 4 5 6 7
3.428538 NaN 3.774607 NaN 2.444216 NaN 4.371588 NaN
cgaussTemp[is.nan(cgaussTemp)] <- cgaussTemp[!is.nan(cgaussTemp)] + 10
sdTemp[is.nan(sdTemp)] <- sdTemp[!is.nan(sdTemp)]
heatAlarmModel$Temp <- list( coef = cgaussTemp, sd = sdTemp)
heatAlarmModel$Temp
Parameters of node Temp (conditional Gaussian distribution)
Conditional density: Temp | Alarm + Season
Coefficients:
0 1 2 3 4 5 6 7
(Intercept) 16.68242 26.68242 21.16419 31.16419 28.02095 38.02095 21.01119 31.01119
Standard deviation of the residuals:
0 1 2 3 4 5 6 7
3.428538 3.428538 3.774607 3.774607 2.444216 2.444216 4.371588 4.371588
Discrete parents' configurations:
Alarm Season
0 no winter
1 yes winter
2 no spring
3 yes spring
4 no summer
5 yes summer
6 no fall
7 yes fall
We have a complete model now, with a modelled anomaly detection schema. Lets test some queries.
e <- list( "Season" = SeasonStates$Winter, "TS1" = 23, "TS2" = 33, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.3232407
This temperature is too high for winter, so the alarm is likely to go up.
e <- list( "Season" = SeasonStates$Summer, "TS1" = 23, "TS2" = 33, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0
However this is a good fit for the higher temperatures that are registered in summer.
We could model additional anomalies for the individual malfunctioning of each sensor. In that case we could observe either global temperature anomalies or individual malfunctions.
A common pattern is to add a new hidden discrete variable conditioning each sensor with a random distribution in the case of malfunction.
# New Variables
TS1FaultStates <- list(
No = "no",
Yes = "yes"
)
dataLatent <- dataLatentSeason %>%
mutate(
Alarm = factor(AlarmStates$No, levels = AlarmStates),
TS1Fault = factor(TS1FaultStates$No, levels = TS1FaultStates)
) %>%
select(Alarm, Season, Temp, TS1Fault, TS1, TS2, TS3) %>%
as.data.frame()
dataImputed <- dataLatent %>%
rowwise() %>%
mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
as.data.frame()
heatAlarmDag <- model2network("[Alarm][Season][TS1Fault][Temp|Season:Alarm][TS1|Temp:TS1Fault][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag)
We launch EM and modify the new probability of the nodes (now Alarm and TS1 Fault). For that we need additional exprt knowledge from the domain. We ask our IoT team about the fault tolerance of our sensors. They tell us the following>
Our sensors are the best, they usually do not fail. The testing guys estimate just a 1/1000 failures.
This is good, with this marginal error probbility out new node will only detect anomalies without introducing noise in the remaining propagation within the model.
heatAlarmModel <- parametric.em(heatAlarmDag, dataLatent, dataImputed, iter = 10)
# TODO: Please dont do this, this is not DRY...
cptAlarm[] <- c(0.999, 0.001)
heatAlarmModel$Alarm <- cptAlarm
cgaussTemp <- coef(heatAlarmModel$Temp)
sdTemp <- heatAlarmModel$Temp$sd
cgaussTemp[is.nan(cgaussTemp)] <- cgaussTemp[!is.nan(cgaussTemp)] + 10
sdTemp[is.nan(sdTemp)] <- sdTemp[!is.nan(sdTemp)]
heatAlarmModel$Temp <- list( coef = cgaussTemp, sd = sdTemp)
cptTS1Fault <- coef(heatAlarmModel$TS1Fault)
print(cptTS1Fault)
no yes
1 0
cptTS1Fault[] <- c(0.9999, 0.0001)
heatAlarmModel$TS1Fault <- cptTS1Fault
print(heatAlarmModel$TS1Fault)
Parameters of node TS1Fault (multinomial distribution)
Conditional probability table:
no yes
0.9999 0.0001
And now for the TS1 node, in which we have to add a random distribution. We can use a Gaussian with lots of variance.
cgaussTS1 <- coef(heatAlarmModel$TS1)
sdTS1 <- heatAlarmModel$TS1$sd
print(cgaussTS1)
0 1
(Intercept) -3.7579690 NaN
Temp 0.9554187 NaN
print(sdTS1)
0 1
0.1100198 NaN
cgaussTS1[is.nan(cgaussTS1)] <- 0
sdTS1[is.nan(sdTS1)] <- 1000000
heatAlarmModel$TS1 <- list( coef = cgaussTS1, sd = sdTS1)
heatAlarmModel$TS1
Parameters of node TS1 (conditional Gaussian distribution)
Conditional density: TS1 | Temp + TS1Fault
Coefficients:
0 1
(Intercept) -3.7579690 0.0000000
Temp 0.9554187 0.0000000
Standard deviation of the residuals:
0 1
1.100198e-01 1.000000e+06
Discrete parents' configurations:
TS1Fault
0 no
1 yes
Lets test some inference cases in which the TS1 sensor is malfunctioning or reporting some kind of anomalous measures.
If TS1 is only one capturing a very high temperature the fault node will detect the anomaly
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 24, "TS3" = 20 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.1508417
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 24, "TS3" = 20 )
query <- cpquery(heatAlarmModel, event = TS1Fault == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.4746715
If the three sensors are capturing the same anomaly, then the TS1Fault node will assume that the problem is outside the model.
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 37, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.8
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 37, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = TS1Fault == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0
This is a variation on the previous problem. Can you spot the differences introduced by this new version?
dataRaw <- read_csv("./data/heatAlarm-lvl2.csv")
Parsed with column specification:
cols(
Date = col_date(format = ""),
Hour = col_character(),
TSO = col_double(),
TS1 = col_double(),
TS2 = col_double(),
TS3 = col_double()
)
It seems that the measures are somehow dependent, but the relationship is not that clear that in the previous phase. An ex
dataRaw %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
geom_line()
dataRaw %>%
gather("Sensor", "Measure", TS1, TS2, TS3) %>%
ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
geom_smooth()
Our experts tell us that the indoor temperature of the room is affected by the outdoor weather, specially near the windows and doors. The temperature in the interior part of the room is more stable.
Given this new knowledge we decide to place a new sensor outdoors TSO
. Now we can have an additional component to add to our model, hopefully it will allow to mixture the indoor and outdoor temperatures.
dataRaw %>%
gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
geom_smooth()
dataByMonth <- dataRaw %>%
mutate(Month = as.factor(month(Date, label = TRUE, abbr = TRUE)))
dataByMonth %>%
gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) +
geom_density(alpha=0.5) +
facet_wrap(~Month)
dataBySeason <- dataByMonth %>%
mutate(Season = season(Date))
dataBySeason %>%
gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) +
geom_density(alpha=0.4) +
facet_wrap(~Season)
To add the new sensor to the model we are going to need an additional latent variable. Our model will now have two latent variables to model the temperature TInd
and TOut
(indoor and outdoor). The Indoor temperature will depend on the outdoor temperature. Lets build it:
heatAlarmDag2 <- model2network("[Alarm][Season][TOut|Season][TS0|TOut][TInd|TOut:Alarm][TS1|TInd][TS2|TInd][TS3|TInd]")
plotD3bn(heatAlarmDag2)
This model is the base example of the new domain. However we can have some doubts regarding the relationship of the variable TOut and the models, look closely at the distributions and you will see that not all distributions are affected equally by the outdoor temperature. We ask again…
IoT team: Uh, yes. We instaled TS1 in a corner of the room, near the entrance. TS2 is on the ceiling and might be near a ventilation shaft, so it will probably capture a bit more of heat from the machines. TS3 should be more stable, it is located in the middle of the room.
This almost make sense visually. However, at this point we will need to test the different structures, either by learning all of them and comparing with BIC, or running a structural learning process.
This tutorial is licensed under GPL3 license as is originally published and maintained in this github repository by Jacinto Arias