Introduction

In this tutorial we will learn some of the basics of Bayesian Networks by solving a simulated problem. Our agenda will include:

  • A quick brackground review on the packages we will be using (tidyverse and bnlearn)
  • A statistical approach to the problem at hand
  • Modeling with epert knowledge
  • Learning parameters from data with latent variables
  • Probability queries and simulation
  • Advanced use cases

Seed

set.seed(1020)

Aside: The tidyverse

You should know the tidyverse, it makes R easier and more fun.

You can get more information on https://www.tidyverse.org

I really suggest you to take Hadley’s Wickham Data Science course http://r4ds.had.co.nz

In this tutorial I will be using the following tidyverse utilities:

  • dplyr in general
  • ggplot2 for almost any plot
  • tidyr specifically the gather function
  • lubridate to extract components from Dates
  • read_r to read data from csv files

In order to make the tutorial work you should install and load the following packages

library(tidyverse)
library(lubridate)

The bnlearn package

The bnlearn package is the most complete and popular package for Bayesian Networks available to the date in R. We will start our tutorial by reviewing some of its capacities.

First we should install and load the package:

library(bnlearn)

*Please make sure that you either download or compile at least bnelarn 4.2.x as previous versions might have some bugs.

You should check out the documentation that is available online [Link]. It contains nice examples and additional use cases beyond this tutorial, if you are interested in this topic there are a couple of related books written by the author of the package, you can find the references in the mentioned website.

Data Structures

The usage of bnlearn revolves around the usage of two main data structures to represent a Bayesian Network in different stages (NOTE that these are S3 classes and the names might overlap with other functions):

  • bn [Docs]. Represents the structural information, variables, graph and learning algorithm if provided.
  • bn.fit [Docs]. Adds the parametric information on top of the previous structure. Contains the distribution of each node according to its type and parent configuration.

Creating the structure of Bayesian networks

There are different ways to manually initialize and modify the graph of a Bayesian Network.

We can create empty graphs from a set of variables:

vars <- LETTERS[1:6]
dag  <- empty.graph(vars)
dag

  Random/Generated Bayesian network

  model:
   [A][B][C][D][E][F] 
  nodes:                                 6 
  arcs:                                  0 
    undirected arcs:                     0 
    directed arcs:                       0 
  average markov blanket size:           0.00 
  average neighbourhood size:            0.00 
  average branching factor:              0.00 

  generation algorithm:                  Empty 

You can specify arcs sets in various ways. The most common are:

As a two column (from, to) matrix, setting them via arcs to an existing network:

e <- matrix(
      c("A", "C", "B", "F", "C", "F"),
      ncol = 2, byrow = TRUE,
      dimnames = list(NULL, c("from", "to"))
    )
arcs(dag) <- e
dag

  Random/Generated Bayesian network

  model:
   [A][B][D][E][C|A][F|B:C] 
  nodes:                                 6 
  arcs:                                  3 
    undirected arcs:                     0 
    directed arcs:                       3 
  average markov blanket size:           1.33 
  average neighbourhood size:            1.00 
  average branching factor:              0.50 

  generation algorithm:                  Empty 

We can also use an adjancecy matrix, and assign it to a dag with amat:

adj <- matrix(
        0L, 
        ncol = 6, 
        nrow = 6,
        dimnames = list(vars, vars)
       )
adj["A", "C"] = 1L
adj["B", "F"] = 1L
adj["C", "F"] = 1L
adj["D", "E"] = 1L
adj["A", "E"] = 1L
print(adj)
  A B C D E F
A 0 0 1 0 1 0
B 0 0 0 0 0 1
C 0 0 0 0 0 1
D 0 0 0 0 1 0
E 0 0 0 0 0 0
F 0 0 0 0 0 0
amat(dag) <- adj
dag

  Random/Generated Bayesian network

  model:
   [A][B][D][C|A][E|A:D][F|B:C] 
  nodes:                                 6 
  arcs:                                  5 
    undirected arcs:                     0 
    directed arcs:                       5 
  average markov blanket size:           2.33 
  average neighbourhood size:            1.67 
  average branching factor:              0.83 

  generation algorithm:                  Empty 

The last option is to create a formula for a given set of variables. Each node is specified by a pair of brackets [<var_name>]. If the node has a parentset we denote it by | and specify the list of parents separated by colons :. We can compute the formula into a bn object with model2network.

dag <- model2network("[A][C][B|A][D|C][F|A:B:C][E|F]")
dag

  Random/Generated Bayesian network

  model:
   [A][C][B|A][D|C][F|A:B:C][E|F] 
  nodes:                                 6 
  arcs:                                  6 
    undirected arcs:                     0 
    directed arcs:                       6 
  average markov blanket size:           2.67 
  average neighbourhood size:            2.00 
  average branching factor:              1.00 

  generation algorithm:                  Empty 

The package provide other capabilities such as random graphs and graph sampling. In addition there is a range of utilities to manipulate the graphs [Link] [Link].

Plotting graphs

We can ploting graphs using the built in R engine by using plot for the bn class:

plot(dag)

Minimal aspects of the plot can be customized as documented in the corresponding help page. Other packages can be used indrectly to plot graphs, bnlearn provides connections with some of them but be aware that some of them might be outdated.

Fortunatelly, graphs are a common data structure and we can find lots of utilities to work with them. One of the most common visualization tools for graphs, and for data in general, is the D3 library from the Javascript domain. Thanks to the integration of R shiny with web technologies we can find wonderful ports such as the networkD3 package.

The next snippet is just a custom function to transform a bn object to the required information to plot a D3 force graph.

Please install the networkD3, be carefull as this package is currently under beta development and future version could break this code.

plotD3bn <- function(bn) {
  varNames <- nodes(bn)
  # Nodes should be zero indexed!
  links <- data.frame(arcs(bn)) %>%
    mutate(from = match(from, varNames)-1, to = match(to, varNames)-1, value = 1)
  
  nodes <- data.frame(name = varNames) %>%
    mutate(group = 1, size = 30)
  
  networkD3::forceNetwork(
    Links = links,  
    Nodes = nodes,
    Source = "from",
    Target = "to",
    Value = "value",
    NodeID = "name",
    Group = "group",
    fontSize = 20,
    zoom = TRUE,
    arrows = TRUE,
    bounded = TRUE,
    opacityNoHover = 1
  )
}

We can now plot our last generated BN. If the result looks too small you can zoom in and out using your mouse wheel.

plotD3bn(dag)

Loading Bayesian networks from files

There are different file formats to represent a Bayesian network. They have originated over the years as an effort to create standards or as part of particular propietary systems. Despite the effort there is not a consensuated closed standard, perhaps the closest one is the xbif format [Link].

bnlearn provides several ways to load BNs from different formats [Docs]. You should use them when importing and exporting to other systems or to share them with other people, if you plan to use just bnlearn you could just save an rda file.

The Bayes net repository

The maintainers of bnlearn also provide a modern R-focused repository for a series of popular Bayesian networks that have been used extensivelly for benchmarking on the literature.

http://www.bnlearn.com/bnrepository/

In here you can find networks with different properties that can be use to test algorithms or explore this or other BN packages. We will now work with the so popular asia network: Asia is for BNs what “iris” is for statistics.

# This downloads the RData file from the repository and loads it.
# The bn is loaded into a bn.fit variable called bn
load(url("http://www.bnlearn.com/bnrepository/asia/asia.rda"))
asia <- bn

Discrete networks

We will know explore the contents of this variable. As you can see, this not only contains the structure but the parameters as well, but first we will take a look at the structure:

bn.net(asia)

  Random/Generated Bayesian network

  model:
   [asia][smoke][tub|asia][lung|smoke][bronc|smoke][either|tub:lung][xray|either]
   [dysp|bronc:either]
  nodes:                                 8 
  arcs:                                  8 
    undirected arcs:                     0 
    directed arcs:                       8 
  average markov blanket size:           2.50 
  average neighbourhood size:            2.00 
  average branching factor:              1.00 

  generation algorithm:                  Empty 
plotD3bn(asia)

Now is the time to review the parameters, this prints each node and the asociated probability table. In this case all variables are discrete so the tables would be conditional probability tables.

asia

  Bayesian network parameters

  Parameters of node asia (multinomial distribution)

Conditional probability table:
  yes   no 
0.01 0.99 

  Parameters of node tub (multinomial distribution)

Conditional probability table:
 
     asia
tub    yes   no
  yes 0.05 0.01
  no  0.95 0.99

  Parameters of node smoke (multinomial distribution)

Conditional probability table:
 yes  no 
0.5 0.5 

  Parameters of node lung (multinomial distribution)

Conditional probability table:
 
     smoke
lung   yes   no
  yes 0.10 0.01
  no  0.90 0.99

  Parameters of node bronc (multinomial distribution)

Conditional probability table:
 
     smoke
bronc yes  no
  yes 0.6 0.3
  no  0.4 0.7

  Parameters of node either (multinomial distribution)

Conditional probability table:
 
, , tub = yes

      lung
either yes no
   yes   1  1
   no    0  0

, , tub = no

      lung
either yes no
   yes   1  0
   no    0  1


  Parameters of node xray (multinomial distribution)

Conditional probability table:
 
     either
xray   yes   no
  yes 0.98 0.05
  no  0.02 0.95

  Parameters of node dysp (multinomial distribution)

Conditional probability table:
 
, , either = yes

     bronc
dysp  yes  no
  yes 0.9 0.7
  no  0.1 0.3

, , either = no

     bronc
dysp  yes  no
  yes 0.8 0.1
  no  0.2 0.9

We can access individual nodes of the net as in a data.frame:

asia$smoke

  Parameters of node smoke (multinomial distribution)

Conditional probability table:
 yes  no 
0.5 0.5 

There is also a function to plot the distributions of discrete networks:

bn.fit.barchart(asia$smoke)

bn.fit.barchart(asia$dysp)

Introducing expert knowledge

We can alter the probability tables of a BN, this is useful to override parameters learnt from data or not observed variables. This method allows us to include expert information from the domain of the problem modelled.

To modify a conditional probability table you can just directly replace the existing table in the model by extracting it with coef. Be careful to maintain the inherent restrictions of the probability distribution.

cpt <- coef(asia$smoke)
cpt[] <- c(0.2, 0.8)
asia$smoke <- cpt
asia$smoke

  Parameters of node smoke (multinomial distribution)

Conditional probability table:
 yes  no 
0.2 0.8 

Sampling data from a Bayesian network

bnlearn introduces an R like function to sample data from a given fitted model rbn. We will now sample a dataset from asia to test learning from data.

# Note that the order of the parameters is inverted from the R sample functions
sampleAsia <- rbn(x = asia, n = 10000)
head(sampleAsia)

Parametric learning from data

We can induce the parameters of a Bayesian Network from observed data. bnlearn provides different algorithms for that, while Maximum Likelihood Estimation (MLE) is the most common one.

We can invoke the learning algorithm by using the function bn.fit. For that we need a DAG and a compatible dataset:

net <- bn.net(asia)
asiaInduced <- bn.fit(x = net, data = sampleAsia)

We can now compare the two networks, there should be some discrepacies in the induced one. Notice that extremelly marginal probabilities will not be simulated and thus will not have a significant present in the sample.

asiaInduced

  Bayesian network parameters

  Parameters of node asia (multinomial distribution)

Conditional probability table:
    yes     no 
0.0103 0.9897 

  Parameters of node tub (multinomial distribution)

Conditional probability table:
 
     asia
tub          yes         no
  yes 0.04854369 0.01040719
  no  0.95145631 0.98959281

  Parameters of node smoke (multinomial distribution)

Conditional probability table:
    yes     no 
0.2008 0.7992 

  Parameters of node lung (multinomial distribution)

Conditional probability table:
 
     smoke
lung         yes         no
  yes 0.09711155 0.01051051
  no  0.90288845 0.98948949

  Parameters of node bronc (multinomial distribution)

Conditional probability table:
 
     smoke
bronc       yes        no
  yes 0.6000996 0.2936687
  no  0.3999004 0.7063313

  Parameters of node either (multinomial distribution)

Conditional probability table:
 
, , lung = yes

      tub
either yes no
   yes   1  1
   no    0  0

, , lung = no

      tub
either yes no
   yes   1  0
   no    0  1


  Parameters of node xray (multinomial distribution)

Conditional probability table:
 
     either
xray         yes         no
  yes 0.96850394 0.05291610
  no  0.03149606 0.94708390

  Parameters of node dysp (multinomial distribution)

Conditional probability table:
 
, , either = yes

     bronc
dysp         yes         no
  yes 0.91847826 0.68020305
  no  0.08152174 0.31979695

, , either = no

     bronc
dysp         yes         no
  yes 0.80938242 0.09854423
  no  0.19061758 0.90145577

Structural learning

In many ocasions the structure of the model is designed by hand. We usually know the variables of the problems and the different causal patterns from expert knowledge, this descriptive information can be tranfered to the model. This is what we call an open box model and provides a powerful framework for many problems, specially when compared to other models that do not provide a clear interpretation of their parameters.

However, there are many situations in which we would like to automatize the structural learning of a model such as causal patterns discovery or a lack of knowledge of the domain.

In other cases we just want to learn about particular dependency relationships between the variables or select the best structure among a particular set.

bnlearn specializes in structural learning, and for this reason it provides a range of algorithms to learn the models. There is a complex taxonomy of such algorithms related to the statistical tests, metrics and heuristics in which they are based. Exact learning is a NP-hard problem and thus several approaches have been proposed.

We will briefly study the hc algorithm to learn a full structure and the BIC score metric to measure the fit of a particular network with a dataset.

The hc algorithm can be simply run from a sample of data, although many options can be set to optimize it for a particular problem.

networkInduced <- hc(x = sampleAsia)
networkInduced

  Bayesian network learned via Score-based methods

  model:
   [asia][tub][smoke][lung|smoke][bronc|smoke][either|tub:lung][xray|either][dysp|bronc:either] 
  nodes:                                 8 
  arcs:                                  7 
    undirected arcs:                     0 
    directed arcs:                       7 
  average markov blanket size:           2.25 
  average neighbourhood size:            1.75 
  average branching factor:              0.88 

  learning algorithm:                    Hill-Climbing 
  score:                                 BIC (disc.) 
  penalization coefficient:              4.60517 
  tests used in the learning procedure:  77 
  optimized:                             TRUE 

Lets compare it with the original network golden model as the algorithm may have introduced some differences given that we used a small data sample.

modelstring(asia)
[1] "[asia][smoke][tub|asia][lung|smoke][bronc|smoke][either|lung:tub][xray|either][dysp|bronc:either]"

We can also compute some metrics to compare the network. The structural Hamming distance determines the amount of discrepancy between the two graphs.

shd(bn.net(asia), networkInduced)
[1] 1

The BIC metric measures the fit of the structure for a given sample, it also penalizes the number of parameters to avoid overfitting. In this case the result is almost the same, the lower the metric the better, so it seems that the induced model could be biased towards the sample and marginally outperforms the golden model.

print(BIC(object = asia, data = sampleAsia))
[1] -19437.54
print(BIC(object = networkInduced, data = sampleAsia))
[1] -19432.09

Gaussian networks

Gaussian networks differ in the class of probability tables that represent them. If all nodes are gaussian we will find Gaussian nodes and conditional Gaussian nodes. Gaussian nodes are encoded by the normal distribution parameters (mean and sd), conditional gaussian are represented as linear reggresion with a coef for each parent, an intercept term and standard deviation of the residuals.

bnlearn has some sample gaussian data to test these BNs:

data(gaussian.test)
dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
model <- bn.fit(dag, gaussian.test)
model

  Bayesian network parameters

  Parameters of node A (Gaussian distribution)

Conditional density: A
Coefficients:
(Intercept)  
   1.007493  
Standard deviation of the residuals: 1.004233 

  Parameters of node B (Gaussian distribution)

Conditional density: B
Coefficients:
(Intercept)  
   2.039499  
Standard deviation of the residuals: 3.034111 

  Parameters of node C (Gaussian distribution)

Conditional density: C | A + B
Coefficients:
(Intercept)            A            B  
   2.001083     1.995901     1.999108  
Standard deviation of the residuals: 0.5089772 

  Parameters of node D (Gaussian distribution)

Conditional density: D | B
Coefficients:
(Intercept)            B  
   5.995036     1.498395  
Standard deviation of the residuals: 0.3286672 

  Parameters of node E (Gaussian distribution)

Conditional density: E
Coefficients:
(Intercept)  
   3.493906  
Standard deviation of the residuals: 1.98986 

  Parameters of node F (Gaussian distribution)

Conditional density: F | A + D + E + G
Coefficients:
 (Intercept)             A             D             E             G  
-0.006047321   1.994853041   1.005636909   1.002577002   1.494373265  
Standard deviation of the residuals: 0.9961475 

  Parameters of node G (Gaussian distribution)

Conditional density: G
Coefficients:
(Intercept)  
   5.028076  
Standard deviation of the residuals: 1.983631 

To modify a gaussian network node we proceed as in the discrete caser:

model$A <- list(coef = c("(Intercept)" = 10), sd = 0)
model$A

  Parameters of node A (Gaussian distribution)

Conditional density: A
Coefficients:
(Intercept)  
         10  
Standard deviation of the residuals: 0 

Hybrid networks

An hybrid network contains both discrete and continuous variables. There is a framework restriction in which a discrete variable cannot have any continuous parent. The parameters are modelled as in the previous cases.

A continuous variable with discrete parents is represented by a conditional Gaussian distribution, where there is a linear gaussian distribution (according to any continuous parents) for each configuration of the discrete parents.

In the next example we will use custom.fit to manually load the parameters into a graph:

net <- model2network("[A][B][C|A:B]")
cptA  <- matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("LOW", "HIGH")))
distB <- list(coef = c("(Intercept)" = 1), sd = 1.5)
distC <- list(
  coef = 
    matrix(
      c(1.2, 2.3, 3.4, 4.5), 
      ncol = 2,
      dimnames = list(c("(Intercept)", "B"), NULL)
    ),
    sd = c(0.3, 0.6)
)
model = custom.fit(net, dist = list(A = cptA, B = distB, C = distC))
model

  Bayesian network parameters

  Parameters of node A (multinomial distribution)

Conditional probability table:
 
 LOW HIGH 
 0.4  0.6 

  Parameters of node B (Gaussian distribution)

Conditional density: B
Coefficients:
(Intercept)  
          1  
Standard deviation of the residuals: 1.5 

  Parameters of node C (conditional Gaussian distribution)

Conditional density: C | A + B
Coefficients:
               0    1
(Intercept)  1.2  3.4
B            2.3  4.5
Standard deviation of the residuals:
  0    1  
0.3  0.6  
Discrete parents' configurations:
      A
0   LOW
1  HIGH

Inference and probability queries

The inference engine of bnlearn is limited, but it can be used to test the networks and to perform basic operations with the models.

By using cpquery we can ask for the probability of an event given a set of evidence, both of them are boolean expressions involving the variables in the model. We may ask for a particular combination of configurations in the BN and a set of observed statuses for the variables in the evidence.

For example we could ask, what is the posibility of a positive cancer diagnosis for a person that smokes?, in the asia network.

(For cpquery I recommend the most powerfull lw algorithm)

# First we should observe the prior probability to compare
# TRUE is for empty evidence
cpquery(asia, event = lung == "yes", evidence = TRUE)
[1] 0.0266
cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)
[1] 0.10054

Notice that the method is not much stable, so it is better to aggregate a repeated sample of predictions:

query <- cpquery(asia, event = lung == "yes", evidence = TRUE)
print(mean(rep(query, 1000)))
[1] 0.0216
query <- cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)
print(mean(rep(query, 1000)))
[1] 0.0997

The other option is to use cpdist to sample cases from the network for a set of nodes in the presence of some evidence, the usage is the same, and we can obtain more stability by increasing the size of the sample.

s <- cpdist(asia, nodes = c("lung"), evidence = TRUE, n=1000)
head(s)

We can even compute or plot of the distribution:

summary(s)
  lung    
 yes: 21  
 no :979  
prop.table(table(s))
s
  yes    no 
0.021 0.979 
ggplot(s, aes(x=lung)) + geom_bar()


Modeling with Bayesian Networks

Now that we have a nice overview on the operation with bnlearn we will switch to an applied problem in which we want to build a model from a real context.

The problem

We are part of a Data Science team from an IoT company that has developed new Temperature sensor to monitor data center rooms in order to detect and predict early refrigeration failures. The sensors were installed a year ago, and we need to create a model that would be able to detect specific heat outbreaks or global heating problems in the room.

We are provided with a dataset consisting on a number of measures per day during the previous year for a total of 3 sensors that have been places in strategic places across room.

We will start to work on the first version of the problem heatAlarm-lvl1.csv.

dataRaw <- read_csv("./data/heatAlarm-lvl1.csv")
Parsed with column specification:
cols(
  Date = col_date(format = ""),
  Hour = col_character(),
  TS1 = col_double(),
  TS2 = col_double(),
  TS3 = col_double()
)
head(dataRaw)

Exploratory analisys

We will start by studying the properties of our problem at hand. First we should look into the distribution of the dataset.

summary(dataRaw)
      Date                Hour                TS1              TS2             TS3       
 Min.   :2016-01-01   Length:1830        Min.   : 4.627   Min.   :12.44   Min.   : 8.89  
 1st Qu.:2016-04-01   Class :character   1st Qu.:12.745   1st Qu.:21.41   1st Qu.:17.61  
 Median :2016-07-01   Mode  :character   Median :16.979   Median :26.09   Median :22.15  
 Mean   :2016-07-01                      Mean   :17.011   Mean   :26.12   Mean   :22.19  
 3rd Qu.:2016-10-01                      3rd Qu.:21.384   3rd Qu.:30.96   3rd Qu.:26.88  
 Max.   :2016-12-31                      Max.   :31.276   Max.   :41.89   Max.   :37.50  

We see that we have a complete year of measures and different ranges for each sensor. Now it is interesting to see the completeness of the data, we would also like to se its distribution along dates.

dataRaw %>%
  group_by(Date) %>%
  summarise(Measures = n()) %>%
  ggplot(aes(x=Measures)) + geom_histogram()

We confirm that the number of measures is stable during the day. We would now observe the distribution of the Temperature measures for each sensor.

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_line()

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_smooth()

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_boxplot()

We observe three different distributions for each sensor. Visually they appear to be some linear variation, in fact, a simple additive relationship that could be modelled by adding a scalar intercept value. This dependency relationship can be modelled in a BN, but, what is the best way to do it?

Latent variables

The solution is elegant from the computational and the statistical point of view. We can think that the sensors are measuring some real and not observable temperature from the environment. This hidden value conditions the exact measures that the sensors capture, and thus can model its individual distributions.

In other words, our model requires an additional variable, more concretelly a Temp latent variable that cannot be observed but that is captured by each sensor by following a particular dependency, we will represent this by adding a new variable to our data (we will also remove unused variables).

dataLatent <- dataRaw %>% 
  mutate(Temp = NA) %>%
  select(Temp, TS1, TS2, TS3) %>%
  as.data.frame()
dataLatent$Temp <- as.numeric(dataLatent$Temp)

The next step is to estimate this variable to complete our model. We can learn the three network distributions from the data but not the Temp variable. For that we would use the EM algorithm. First, lets build the model:

heatAlarmDag1 <- model2network("[Temp][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag1)

Now for the EM, unfortunatelly bnlearn does not provide a standalone implementation of this algorithm (it does for structural learning EM which is a superset), but it provides the tools to to so. Basically, the parametric EM algorithm can be implemented by:

  • E step: Impute the missing variables in the data by using Bayesian estimation from the actual model.
  • M step: Learning new parameters for the model by maximizing the likelihood from the imputten data (MLE).

As you can imagine these steps would be repetead though a number of iterations or convergence. To impute the latent variables we will use the function inpute, however, we would need an initial model to do so. This is another important element for the EM algorithm, as we would need a good guess to ensure a proper fit.

Lets first try with a random initialization of Temp

# Lets add uniform Temp 
dataImputed <- dataLatent %>% rowwise() %>% mutate(Temp = runif(1, 10, 25))
dataImputed %>%
  ggplot(aes(x=Temp)) + geom_density()

The EM algorithm is then plain simple

parametric.em <- function(dag, dataLatent, dataImputed, iter = 5) {
  fitted <- bn.fit(dag, dataImputed, method = "mle")
  for ( i in seq(iter)) {
   complete <- impute(fitted, data = dataLatent, method = "bayes-lw")
   fitted <- bn.fit(dag, complete, method = "mle")
  }
  fitted
}

Let’s try it

heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)

Lets see the resulting model:

heatAlarmModel1

  Bayesian network parameters

  Parameters of node Temp (Gaussian distribution)

Conditional density: Temp
Coefficients:
(Intercept)  
   17.43173  
Standard deviation of the residuals: 0.3912403 

  Parameters of node TS1 (Gaussian distribution)

Conditional density: TS1 | Temp
Coefficients:
(Intercept)         Temp  
  247.33610    -13.21297  
Standard deviation of the residuals: 0.04914744 

  Parameters of node TS2 (Gaussian distribution)

Conditional density: TS2 | Temp
Coefficients:
(Intercept)         Temp  
  280.69264    -14.60381  
Standard deviation of the residuals: 0.05432086 

  Parameters of node TS3 (Gaussian distribution)

Conditional density: TS3 | Temp
Coefficients:
(Intercept)         Temp  
  269.48291    -14.18656  
Standard deviation of the residuals: 0.05276883 
# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()

Look at those intercept coefficient, they do not look right. As we can see in the plot it seems like a forced fit, where the Temp variable has a very different distribution. Lets try another guess at the prior for Temp, this time we will use a Gaussian distribution with the ststistics drawn from the sensors.

statistics <- dataLatent %>% 
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  summarise( 
    mu    = mean(Measure),
    sigma = sd(Measure)
  )
dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = rnorm(1, statistics$mu, statistics$sigma))
dataImputed %>%
  ggplot(aes(x=Temp)) + geom_density()

Let’s repeat the EM:

heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
heatAlarmModel1

  Bayesian network parameters

  Parameters of node Temp (Gaussian distribution)

Conditional density: Temp
Coefficients:
(Intercept)  
   21.75432  
Standard deviation of the residuals: 0.1244617 

  Parameters of node TS1 (Gaussian distribution)

Conditional density: TS1 | Temp
Coefficients:
(Intercept)         Temp  
 -886.55599     41.53506  
Standard deviation of the residuals: 0.04063577 

  Parameters of node TS2 (Gaussian distribution)

Conditional density: TS2 | Temp
Coefficients:
(Intercept)         Temp  
 -972.55651     45.90717  
Standard deviation of the residuals: 0.04491322 

  Parameters of node TS3 (Gaussian distribution)

Conditional density: TS3 | Temp
Coefficients:
(Intercept)         Temp  
 -947.95912     44.59554  
Standard deviation of the residuals: 0.04362999 
# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()

This time the fit is even worse, as we overfitted a bad prior. We will try to make a better guess, this time depedant on the actual value of the sensors.

dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3)))
dataImputed %>%
  ggplot(aes(x=Temp)) + geom_density()

heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
heatAlarmModel1

  Bayesian network parameters

  Parameters of node Temp (Gaussian distribution)

Conditional density: Temp
Coefficients:
(Intercept)  
   21.77576  
Standard deviation of the residuals: 5.46792 

  Parameters of node TS1 (Gaussian distribution)

Conditional density: TS1 | Temp
Coefficients:
(Intercept)         Temp  
 -3.5759540    0.9454084  
Standard deviation of the residuals: 0.05232406 

  Parameters of node TS2 (Gaussian distribution)

Conditional density: TS2 | Temp
Coefficients:
(Intercept)         Temp  
   3.368793     1.044925  
Standard deviation of the residuals: 0.05783186 

  Parameters of node TS3 (Gaussian distribution)

Conditional density: TS3 | Temp
Coefficients:
(Intercept)         Temp  
 0.08259935   1.01507008  
Standard deviation of the residuals: 0.05617952 
# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()

Now it seems that we got a nice fit, the density graph represents the sensors as scalar transformations of the latent Temp variable.

Modeling the domain

If we go back to the plot showing the evolution of the temperature along the year we can see that there could be different distributions for different periods of time. If we can obtain this discriminative information we could do better modeling the domain.

We will start by discriminating the distribution by month:

dataByMonth <- dataRaw %>%
  mutate(Month = as.factor(month(Date, label = TRUE, abbr = TRUE)))
dataByMonth %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Month)

It really seems that the month can capture additional information about the distribution. In this case, we would obtain a better fit to the data if we condition the Month as a discrete parent for the Temp variable. In this case we would need parameters for 12 different gaussian distributions, which may be a bit overkill.

Let’s try if we could summarise the month information by creating a lower dimensional variable. If we think about the weather domain we rapidly come up with the year’s seasons. Let’s try to discriminate the distributions by season.

# Lubridate cant extract seasons so we need a custom function
# We better encode the states as a dict to avoid magic strings
SeasonStates <- list(
  Winter = "winter",
  Spring = "spring",
  Summer = "summer",
  Fall = "fall"
)
season <- function(date){
  WS <- as.Date("2012-12-15", format = "%Y-%m-%d") # Winter Solstice
  SE <- as.Date("2012-3-15",  format = "%Y-%m-%d") # Spring Equinox
  SS <- as.Date("2012-6-15",  format = "%Y-%m-%d") # Summer Solstice
  FE <- as.Date("2012-9-15",  format = "%Y-%m-%d") # Fall Equinox
  
  # Convert dates from any year to 2012 dates
  d <- as.Date(strftime(date, format="2012-%m-%d"))
  
  factor(
    ifelse (d >= WS | d < SE, SeasonStates$Winter,
      ifelse (d >= SE & d < SS, SeasonStates$Spring,
        ifelse (d >= SS & d < FE, SeasonStates$Summer, SeasonStates$Fall))),
    levels = SeasonStates
  )
}
dataBySeason <- dataByMonth %>%
  mutate(Season = season(Date))
dataBySeason %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Season)

The distributions for the season discrimination seems less Gaussian, but they encode the same pattern. Lets try to model both networks and see how well they fit the parameters.

First we will model the Month based one :

dataLatentMonth <- dataByMonth %>%
  mutate(Temp = NA) %>%
  select(Month, Temp, TS1, TS2, TS3) %>%
  as.data.frame()
dataLatentMonth$Temp <- as.numeric(dataLatentMonth$Temp)

And now we model the network DAG

heatAlarmDagMonth <- model2network("[Month][Temp|Month][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDagMonth)

Lets repeat the EM example and induce the parameters of the hidden variable with the previous successful configuration.

dataImputedMonth <- dataLatentMonth %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()
heatAlarmModelMonth <- parametric.em(heatAlarmDagMonth, dataLatentMonth, dataImputedMonth, iter = 10)
dataImputeMonthPosterior <- impute(heatAlarmModelMonth, data = dataLatentMonth, method = "bayes-lw")
dataImputeMonthPosterior %>%
  gather("Sensor", "Measure", Temp, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Month)

This is a very good fit indeed. We can see that Temp is almost identical to TS3, and that the other two sensors are just transformations of this variable. Lets check it the season is a good simplification of the Month variable.

dataLatentSeason <- dataBySeason %>%
  mutate(Temp = NA) %>%
  select(Season, Temp, TS1, TS2, TS3) %>%
  as.data.frame()
dataLatentSeason$Temp <- as.numeric(dataLatentSeason$Temp)
heatAlarmDagSeason <- model2network("[Season][Temp|Season][TS1|Temp][TS2|Temp][TS3|Temp]")
dataImputedSeason <- dataLatentSeason %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()

With that configuration we run the EM

heatAlarmModelSeason <- parametric.em(heatAlarmDagSeason, dataLatentSeason, dataImputedSeason, iter = 10)
dataImputeSeasonPosterior <- impute(heatAlarmModelSeason, data = dataLatentSeason, method = "bayes-lw")
dataImputeSeasonPosterior %>%
  gather("Sensor", "Measure", Temp, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Season)

Another good fit. Which one of the two approaches is better?. The initial guess should be the Season one, as it has fewer parameters than using the Month. Let’s ask the BIC score.

print(BIC(heatAlarmModelSeason, dataImputeSeasonPosterior))
[1] -1004.384
print(BIC(heatAlarmModelMonth, dataImputeMonthPosterior))
[1] -4853.787

As expected, season is by far more superior at minimizing BIC than month and should be our pick.

Modeling anomalies

One of the most appealing aspects of Bayesian Networks is the ability to include expert knowledge in the model. While this is almost impossible in other frameworks it is a natural concepts in this case.

We call BNs open box models because we can understand and interpretate the parameters. This is really useful for modelling anomalies, because we can use latent variables and expert knowledge to represent events that cannot be observed.

If the events cannot be observed we would not have available data to learn from. In our example we would have to destroy machines and lit the room on fire to capture data about anomalies, and that would surely have a high cost. Imagine other scenarios such as rare diseases, autonomous driving or financial fraud.

We will start by modeling a global fault by adding a new variable Alarm, this variable will have two states yes and no. In this example we have been collecting data from the sensors for a year, we assume that the functioning has been correct and that there has not been recorder failures (perhaps because the expert technicians that maintain the room have assured us so).

Given this, the Alarm variable will be always observed to no.

# New Variables
AlarmStates <- list(
  No  = "no",
  Yes = "yes"
)
dataLatent <- dataLatentSeason %>%
  mutate(Alarm = factor(AlarmStates$No, levels = AlarmStates)) %>%
  select(Alarm, Season, Temp, TS1, TS2, TS3) %>%
  as.data.frame()

And now the new dag:

heatAlarmDag <- model2network("[Alarm][Season][Temp|Season:Alarm][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag)

Lets run the EM algorithm one more time, ther should be no changes to the parameters but it will reinforce our logical pipeline.

dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()
heatAlarmModel <- parametric.em(heatAlarmDag, dataLatent, dataImputed, iter = 10)

Lets check the result on the Alarm variable:

heatAlarmModel$Alarm

  Parameters of node Alarm (multinomial distribution)

Conditional probability table:
  no yes 
  1   0 

Obviously the algorithm will fit all the cases to the observed population in which no alarms have occurred. Here is where the expert knowledge enters. We will contact our experts and design a proper parametrization of the variable.

“The system has a random failure rate of 1/1000, in such cases, the room temperature is quickly raised about 10”.

We have enough information to include this expert knowledge into de Alarm variable. Notice that this variable introduces additional parameters into the Temp variable that should be modelled to.

cptAlarm <- coef(heatAlarmModel$Alarm)
print(cptAlarm)
 no yes 
  1   0 
cptAlarm[] <- c(0.999, 0.001)
heatAlarmModel$Alarm <- cptAlarm
print(heatAlarmModel$Alarm)

  Parameters of node Alarm (multinomial distribution)

Conditional probability table:
    no   yes 
0.999 0.001 

Now for the Temp variable, which a little more complex

cgaussTemp <- coef(heatAlarmModel$Temp)
sdTemp     <- heatAlarmModel$Temp$sd
print(cgaussTemp)
                   0   1        2   3        4   5        6   7
(Intercept) 16.68242 NaN 21.16419 NaN 28.02095 NaN 21.01119 NaN
print(sdTemp)
       0        1        2        3        4        5        6        7 
3.428538      NaN 3.774607      NaN 2.444216      NaN 4.371588      NaN 
cgaussTemp[is.nan(cgaussTemp)] <- cgaussTemp[!is.nan(cgaussTemp)] + 10
sdTemp[is.nan(sdTemp)] <- sdTemp[!is.nan(sdTemp)]
heatAlarmModel$Temp <- list( coef = cgaussTemp, sd = sdTemp)
heatAlarmModel$Temp

  Parameters of node Temp (conditional Gaussian distribution)

Conditional density: Temp | Alarm + Season
Coefficients:
                    0         1         2         3         4         5         6         7
(Intercept)  16.68242  26.68242  21.16419  31.16419  28.02095  38.02095  21.01119  31.01119
Standard deviation of the residuals:
       0         1         2         3         4         5         6         7  
3.428538  3.428538  3.774607  3.774607  2.444216  2.444216  4.371588  4.371588  
Discrete parents' configurations:
   Alarm  Season
0     no  winter
1    yes  winter
2     no  spring
3    yes  spring
4     no  summer
5    yes  summer
6     no    fall
7    yes    fall

We have a complete model now, with a modelled anomaly detection schema. Lets test some queries.

e <- list( "Season" = SeasonStates$Winter, "TS1" = 23, "TS2" = 33, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.3232407

This temperature is too high for winter, so the alarm is likely to go up.

e <- list( "Season" = SeasonStates$Summer, "TS1" = 23, "TS2" = 33, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0

However this is a good fit for the higher temperatures that are registered in summer.

More anomalies

We could model additional anomalies for the individual malfunctioning of each sensor. In that case we could observe either global temperature anomalies or individual malfunctions.

A common pattern is to add a new hidden discrete variable conditioning each sensor with a random distribution in the case of malfunction.

# New Variables
TS1FaultStates <- list(
  No  = "no",
  Yes = "yes"
)
dataLatent <- dataLatentSeason %>%
  mutate(
    Alarm = factor(AlarmStates$No, levels = AlarmStates),
    TS1Fault = factor(TS1FaultStates$No, levels = TS1FaultStates)
  ) %>%
  select(Alarm, Season, Temp, TS1Fault, TS1, TS2, TS3) %>%
  as.data.frame()
dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()
heatAlarmDag <- model2network("[Alarm][Season][TS1Fault][Temp|Season:Alarm][TS1|Temp:TS1Fault][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag)

We launch EM and modify the new probability of the nodes (now Alarm and TS1 Fault). For that we need additional exprt knowledge from the domain. We ask our IoT team about the fault tolerance of our sensors. They tell us the following>

Our sensors are the best, they usually do not fail. The testing guys estimate just a 1/1000 failures.

This is good, with this marginal error probbility out new node will only detect anomalies without introducing noise in the remaining propagation within the model.

heatAlarmModel <- parametric.em(heatAlarmDag, dataLatent, dataImputed, iter = 10)
# TODO: Please dont do this, this is not DRY...
cptAlarm[] <- c(0.999, 0.001)
heatAlarmModel$Alarm <- cptAlarm
cgaussTemp <- coef(heatAlarmModel$Temp)
sdTemp     <- heatAlarmModel$Temp$sd
cgaussTemp[is.nan(cgaussTemp)] <- cgaussTemp[!is.nan(cgaussTemp)] + 10
sdTemp[is.nan(sdTemp)] <- sdTemp[!is.nan(sdTemp)]
heatAlarmModel$Temp <- list( coef = cgaussTemp, sd = sdTemp)
cptTS1Fault <- coef(heatAlarmModel$TS1Fault)
print(cptTS1Fault)
 no yes 
  1   0 
cptTS1Fault[] <- c(0.9999, 0.0001)
heatAlarmModel$TS1Fault <- cptTS1Fault
print(heatAlarmModel$TS1Fault)

  Parameters of node TS1Fault (multinomial distribution)

Conditional probability table:
     no    yes 
0.9999 0.0001 

And now for the TS1 node, in which we have to add a random distribution. We can use a Gaussian with lots of variance.

cgaussTS1 <- coef(heatAlarmModel$TS1)
sdTS1     <- heatAlarmModel$TS1$sd
print(cgaussTS1)
                     0   1
(Intercept) -3.7579690 NaN
Temp         0.9554187 NaN
print(sdTS1)
        0         1 
0.1100198       NaN 
cgaussTS1[is.nan(cgaussTS1)] <- 0
sdTS1[is.nan(sdTS1)] <- 1000000
heatAlarmModel$TS1 <- list( coef = cgaussTS1, sd = sdTS1)
heatAlarmModel$TS1

  Parameters of node TS1 (conditional Gaussian distribution)

Conditional density: TS1 | Temp + TS1Fault
Coefficients:
                      0           1
(Intercept)  -3.7579690   0.0000000
Temp          0.9554187   0.0000000
Standard deviation of the residuals:
           0             1  
1.100198e-01  1.000000e+06  
Discrete parents' configurations:
   TS1Fault
0        no
1       yes

Lets test some inference cases in which the TS1 sensor is malfunctioning or reporting some kind of anomalous measures.

If TS1 is only one capturing a very high temperature the fault node will detect the anomaly

e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 24, "TS3" = 20 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.1508417
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 24, "TS3" = 20 )
query <- cpquery(heatAlarmModel, event = TS1Fault == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.4746715

If the three sensors are capturing the same anomaly, then the TS1Fault node will assume that the problem is outside the model.

e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 37, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0.8
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 37, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = TS1Fault == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
[1] 0

A more complex problem

This is a variation on the previous problem. Can you spot the differences introduced by this new version?

dataRaw <- read_csv("./data/heatAlarm-lvl2.csv")
Parsed with column specification:
cols(
  Date = col_date(format = ""),
  Hour = col_character(),
  TSO = col_double(),
  TS1 = col_double(),
  TS2 = col_double(),
  TS3 = col_double()
)

It seems that the measures are somehow dependent, but the relationship is not that clear that in the previous phase. An ex

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_line()

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_smooth()

Looking at the context

Our experts tell us that the indoor temperature of the room is affected by the outdoor weather, specially near the windows and doors. The temperature in the interior part of the room is more stable.

Given this new knowledge we decide to place a new sensor outdoors TSO. Now we can have an additional component to add to our model, hopefully it will allow to mixture the indoor and outdoor temperatures.

dataRaw %>%
  gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_smooth()

dataByMonth <- dataRaw %>%
  mutate(Month = as.factor(month(Date, label = TRUE, abbr = TRUE)))
dataByMonth %>%
  gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Month)

dataBySeason <- dataByMonth %>%
  mutate(Season = season(Date))
dataBySeason %>%
  gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.4) +
    facet_wrap(~Season)

Adding new components to the model

To add the new sensor to the model we are going to need an additional latent variable. Our model will now have two latent variables to model the temperature TInd and TOut (indoor and outdoor). The Indoor temperature will depend on the outdoor temperature. Lets build it:

heatAlarmDag2 <- model2network("[Alarm][Season][TOut|Season][TS0|TOut][TInd|TOut:Alarm][TS1|TInd][TS2|TInd][TS3|TInd]")
plotD3bn(heatAlarmDag2)

This model is the base example of the new domain. However we can have some doubts regarding the relationship of the variable TOut and the models, look closely at the distributions and you will see that not all distributions are affected equally by the outdoor temperature. We ask again…

IoT team: Uh, yes. We instaled TS1 in a corner of the room, near the entrance. TS2 is on the ceiling and might be near a ventilation shaft, so it will probably capture a bit more of heat from the machines. TS3 should be more stable, it is located in the middle of the room.

This almost make sense visually. However, at this point we will need to test the different structures, either by learning all of them and comparing with BIC, or running a structural learning process.



License

This tutorial is licensed under GPL3 license as is originally published and maintained in this github repository by Jacinto Arias

---
title: "Introductory tutorial on Bayesian networks in R"
author: "Jacinto Arias"
date: "Last Update October 20, 2017"
output:
  html_notebook:
    toc: yes
    toc_depth: 2
    toc_float: yes
---

---

# Introduction

In this tutorial we will learn some of the basics of Bayesian Networks by solving a simulated problem. Our agenda will include:

* A quick brackground review on the packages we will be using (`tidyverse` and `bnlearn`)
* A statistical approach to the problem at hand
* Modeling with epert knowledge
* Learning parameters from data with latent variables
* Probability queries and simulation
* Advanced use cases

#### Seed

```{r}
set.seed(1020)
```

---

# Aside: The tidyverse

You should know the tidyverse, it makes R easier and more fun.

You can get more information on https://www.tidyverse.org

I really suggest you to take Hadley's Wickham Data Science course http://r4ds.had.co.nz

In this tutorial I will be using the following tidyverse utilities:

* `dplyr` in general
* `ggplot2` for almost any plot
* `tidyr` specifically the `gather` function
* `lubridate` to extract components from Dates
* `read_r` to read data from csv files

In order to make the tutorial work you should install and load the following packages

```{r echo=TRUE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(tidyverse)
library(lubridate)
```

---

# The bnlearn package

The `bnlearn` package is the most complete and popular package for Bayesian Networks available to the date in R. We will start our tutorial by reviewing some of its capacities.

First we should install and **load** the package:

```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library(bnlearn)
```

*Please make sure that you either download or compile at least `bnelarn 4.2.x` as previous versions might have some bugs.

You should check out the documentation that is available online [[Link]](http://bnlearn.com). It contains nice examples and additional use cases beyond this tutorial, if you are interested in this topic there are a couple of related books written by the author of the package, you can find the references in the mentioned website.


## Data Structures

The usage of bnlearn revolves around the usage of two main **data structures** to represent a Bayesian Network in different stages (NOTE that these are *S3* classes and the names might overlap with other functions):

* `bn` [[Docs]](http://www.bnlearn.com/documentation/man/bn.class.html). Represents the structural information, variables, graph and learning algorithm if provided.
* `bn.fit` [[Docs]](http://www.bnlearn.com/documentation/man/bn.fit.class.html). Adds the parametric information on top of the previous structure. Contains the distribution of each node according to its type and parent configuration.


## Creating the structure of Bayesian networks

There are different ways to manually initialize and modify the graph of a Bayesian Network.

We can create empty graphs from a set of variables:

```{r}
vars <- LETTERS[1:6]
dag  <- empty.graph(vars)

dag
```

You can specify arcs sets in various ways. The most common are:

As a two column (`from`, `to`) matrix, setting them via `arcs` to an existing network:

```{r}
e <- matrix(
      c("A", "C", "B", "F", "C", "F"),
      ncol = 2, byrow = TRUE,
      dimnames = list(NULL, c("from", "to"))
    )

arcs(dag) <- e

dag
```

We can also use an adjancecy matrix, and assign it to a dag with `amat`:

```{r}
adj <- matrix(
        0L, 
        ncol = 6, 
        nrow = 6,
        dimnames = list(vars, vars)
       )

adj["A", "C"] = 1L
adj["B", "F"] = 1L
adj["C", "F"] = 1L
adj["D", "E"] = 1L
adj["A", "E"] = 1L
print(adj)

amat(dag) <- adj

dag
```

The last option is to create a *formula* for a given set of variables. Each node is specified by a pair of brackets `[<var_name>]`. If the node has a parentset we denote it by `|` and specify the list of parents separated by colons `:`. We can compute the formula into a `bn` object with `model2network`.

```{r}
dag <- model2network("[A][C][B|A][D|C][F|A:B:C][E|F]")
dag
```

The package provide other capabilities such as **random graphs** and **graph sampling**. In addition there is a range of utilities to manipulate the graphs [[Link]](http://www.bnlearn.com/documentation/man/mb.html) [[Link]](http://www.bnlearn.com/documentation/man/graph.html).


## Plotting graphs

We can ploting graphs using the built in R engine by using `plot` for the `bn` class:

```{r}
plot(dag)
```

Minimal aspects of the plot can be customized as documented in the corresponding help page. Other packages can be used indrectly to plot graphs, `bnlearn` provides connections with some of them but be aware that some of them might be outdated.

Fortunatelly, graphs are a common data structure and we can find lots of utilities to work with them. One of the most common visualization tools for graphs, and for data in general, is the D3 library from the *Javascript* domain. Thanks to the integration of **R shiny** with web technologies we can find wonderful ports such as the `networkD3` package.

The next snippet is just a custom function to transform a `bn` object to the required information to plot a *D3 force graph*.

*Please install* the `networkD3`, be carefull as this package is currently under beta development and future version could break this code.

```{r}
plotD3bn <- function(bn) {
  varNames <- nodes(bn)

  # Nodes should be zero indexed!
  links <- data.frame(arcs(bn)) %>%
    mutate(from = match(from, varNames)-1, to = match(to, varNames)-1, value = 1)
  
  nodes <- data.frame(name = varNames) %>%
    mutate(group = 1, size = 30)
  
  networkD3::forceNetwork(
    Links = links,  
    Nodes = nodes,
    Source = "from",
    Target = "to",
    Value = "value",
    NodeID = "name",
    Group = "group",
    fontSize = 20,
    zoom = TRUE,
    arrows = TRUE,
    bounded = TRUE,
    opacityNoHover = 1
  )
}
```


We can now plot our last generated BN. If the result looks too small you can zoom in and out using your *mouse wheel*.


```{r}
plotD3bn(dag)
```


## Loading Bayesian networks from files

There are different file formats to represent a Bayesian network. They have originated over the years as an effort to create standards or as part of particular propietary systems. Despite the effort there is not a consensuated closed standard, perhaps the closest one is the `xbif` format [[Link]](http://www.cs.cmu.edu/~fgcozman/Research/InterchangeFormat/).

`bnlearn` provides several ways to load BNs from different formats [[Docs]](http://www.bnlearn.com/documentation/man/foreign.html). You should use them when importing and exporting to other systems or to share them with other people, if you plan to use just bnlearn you could just save an `rda` file.


## The Bayes net repository

The maintainers of `bnlearn` also provide a modern R-focused repository for a series of popular Bayesian networks that have been used extensivelly for benchmarking on the literature.

http://www.bnlearn.com/bnrepository/

In here you can find networks with different properties that can be use to test algorithms or explore this or other BN packages. We will now work with the so popular `asia` network: *Asia* is for BNs what *"iris"* is for statistics.

```{r}
# This downloads the RData file from the repository and loads it.
# The bn is loaded into a bn.fit variable called bn
load(url("http://www.bnlearn.com/bnrepository/asia/asia.rda"))
asia <- bn
```


## Discrete networks

We will know explore the contents of this variable. As you can see, this not only contains the structure but the parameters as well, but first we will take a look at the structure:

```{r}
bn.net(asia)
```

```{r}
plotD3bn(asia)
```

Now is the time to review the parameters, this prints each node and the asociated probability table. In this case all variables are **discrete** so the tables would be conditional probability tables.

```{r}
asia
```

We can access individual nodes of the net as in a data.frame:

```{r}
asia$smoke
```

There is also a function to plot the distributions of discrete networks:

```{r}
bn.fit.barchart(asia$smoke)
```

```{r}
bn.fit.barchart(asia$dysp)
```


## Introducing expert knowledge

We can alter the probability tables of a BN, this is useful to override parameters learnt from data or not observed variables. This method allows us to include expert information from the domain of the problem modelled.

To modify a **conditional probability table** you can just directly replace the existing table in the model by extracting it with `coef`. Be careful to maintain the inherent restrictions of the probability distribution.

```{r}
cpt <- coef(asia$smoke)
cpt[] <- c(0.2, 0.8)
asia$smoke <- cpt
asia$smoke
```


## Sampling data from a Bayesian network

`bnlearn` introduces an *R like* function to sample data from a given fitted model `rbn`. We will now sample a dataset from *asia* to test learning from data.

```{r}
# Note that the order of the parameters is inverted from the R sample functions
sampleAsia <- rbn(x = asia, n = 10000)
head(sampleAsia)
```


## Parametric learning from data

We can induce the parameters of a Bayesian Network from observed data. `bnlearn` provides different algorithms for that, while Maximum Likelihood Estimation (MLE) is the most common one.

We can invoke the learning algorithm by using the function `bn.fit`. For that we need a DAG and a compatible dataset:

```{r}
net <- bn.net(asia)
asiaInduced <- bn.fit(x = net, data = sampleAsia)
```

We can now compare the two networks, there should be some discrepacies in the induced one. Notice that extremelly marginal probabilities will not be simulated and thus will not have a significant present in the sample.

```{r}
asiaInduced
```

## Structural learning

In many ocasions the structure of the model is designed by hand. We usually know the variables of the problems and the different causal patterns from expert knowledge, this descriptive information can be tranfered to the model. This is what we call an open box model and provides a powerful framework for many problems, specially when compared to other models that do not provide a clear interpretation of their parameters.

However, there are many situations in which we would like to automatize the structural learning of a model such as causal patterns discovery or a lack of knowledge of the domain.

In other cases we just want to learn about particular dependency relationships between the variables or select the best structure among a particular set.

`bnlearn` specializes in structural learning, and for this reason it provides a range of algorithms to learn the models. There is a complex taxonomy of such algorithms related to the statistical tests, metrics and heuristics in which they are based. Exact learning is a NP-hard problem and thus several approaches have been proposed.

We will briefly study the `hc` algorithm to learn a full structure and the `BIC` score metric to measure the fit of a particular network with a dataset.

The `hc` algorithm can be simply run from a sample of data, although many options can be set to optimize it for a particular problem.

```{r}
networkInduced <- hc(x = sampleAsia)
networkInduced
```

Lets compare it with the original network **golden model** as the algorithm may have introduced some differences given that we used a small data sample.

```{r}
modelstring(asia)
```

We can also compute some metrics to compare the network. The structural Hamming distance determines the amount of discrepancy between the two graphs.

```{r}
shd(bn.net(asia), networkInduced)
```

The BIC metric measures the fit of the structure for a given sample, it also penalizes the number of parameters to avoid overfitting. In this case the result is almost the same, the lower the metric the better, so it seems that the induced model could be biased towards the sample and marginally outperforms the golden model.

```{r}
print(BIC(object = asia, data = sampleAsia))
print(BIC(object = networkInduced, data = sampleAsia))
```


## Gaussian networks

Gaussian networks differ in the class of probability tables that represent them. If all nodes are gaussian we will find Gaussian nodes and conditional Gaussian nodes. Gaussian nodes are encoded by the normal distribution parameters (mean and sd), conditional gaussian are represented as linear reggresion with a coef for each parent, an intercept term and standard deviation of the residuals.

`bnlearn` has some sample gaussian data to test these BNs:

```{r}
data(gaussian.test)
dag = model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]")
model <- bn.fit(dag, gaussian.test)
model
```

To modify a gaussian network node we proceed as in the discrete caser:

```{r}
model$A <- list(coef = c("(Intercept)" = 10), sd = 0)
model$A

```


## Hybrid networks

An hybrid network contains both discrete and continuous variables. **There is a framework restriction in which a discrete variable cannot have any continuous parent**. The parameters are modelled as in the previous cases.

A continuous variable with discrete parents is represented by a conditional Gaussian distribution, where there is a linear gaussian distribution (according to any continuous parents) for each configuration of the discrete parents.

In the next example we will use `custom.fit` to manually load the parameters into a graph:

```{r}
net <- model2network("[A][B][C|A:B]")

cptA  <- matrix(c(0.4, 0.6), ncol = 2, dimnames = list(NULL, c("LOW", "HIGH")))
distB <- list(coef = c("(Intercept)" = 1), sd = 1.5)
distC <- list(
  coef = 
    matrix(
      c(1.2, 2.3, 3.4, 4.5), 
      ncol = 2,
      dimnames = list(c("(Intercept)", "B"), NULL)
    ),
    sd = c(0.3, 0.6)
)

model = custom.fit(net, dist = list(A = cptA, B = distB, C = distC))
model
```

## Inference and probability queries

The inference engine of `bnlearn` is limited, but it can be used to test the networks and to perform basic operations with the models.

By using `cpquery` we can ask for the probability of an **event** given a set of evidence, both of them are boolean expressions involving the variables in the model. We may ask for a particular combination of configurations in the BN and a set of observed statuses for the variables in the evidence.

For example we could ask,  *what is the posibility of a positive cancer diagnosis for a person that smokes?*, in the asia network.

(For cpquery I recommend the most powerfull `lw` algorithm)

```{r}
# First we should observe the prior probability to compare
# TRUE is for empty evidence
cpquery(asia, event = lung == "yes", evidence = TRUE)
```
```{r}
cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)
```

Notice that the method is not much stable, so it is better to aggregate a repeated sample of predictions:

```{r}
query <- cpquery(asia, event = lung == "yes", evidence = TRUE)
print(mean(rep(query, 1000)))

query <- cpquery(asia, event = lung == "yes", evidence = list(smoke = "yes"), method = "lw", n = 100000)
print(mean(rep(query, 1000)))
```

The other option is to use `cpdist` to sample cases from the network for a set of nodes in the presence of some evidence, the usage is the same, and we can obtain more stability by increasing the size of the sample.

```{r}
s <- cpdist(asia, nodes = c("lung"), evidence = TRUE, n=1000)
head(s)
```

We can even compute or plot of the distribution:

```{r}
summary(s)
```

```{r}
prop.table(table(s))
```


```{r}
ggplot(s, aes(x=lung)) + geom_bar()
```

---

# Modeling with Bayesian Networks

Now that we have a nice overview on the operation with `bnlearn` we will switch to an applied problem in which we want to build a model from a real context.

## The problem

We are part of a Data Science team from an IoT company that has developed new *Temperature* sensor to monitor data center rooms in order to detect and predict early refrigeration failures. The sensors were installed a year ago, and we need to create a model that would be able to detect specific heat outbreaks or global heating problems in the room.

We are provided with a dataset consisting on a number of measures per day during the previous year for a total of *3 sensors* that have been places in strategic places across room.

We will start to work on the first version of the problem `heatAlarm-lvl1.csv`.

```{r}
dataRaw <- read_csv("./data/heatAlarm-lvl1.csv")
head(dataRaw)
```

## Exploratory analisys

We will start by studying the properties of our problem at hand. First we should look into the distribution of the dataset.

```{r}
summary(dataRaw)
```

We see that we have a complete year of measures and different ranges for each sensor. Now it is interesting to see the completeness of the data, we would also like to se its distribution along dates.

```{r}
dataRaw %>%
  group_by(Date) %>%
  summarise(Measures = n()) %>%
  ggplot(aes(x=Measures)) + geom_histogram()
```

We confirm that the number of measures is stable during the day. We would now observe the distribution of the Temperature measures for each sensor.

```{r}
dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_line()

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_smooth()

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_boxplot()
```

We observe three different distributions for each sensor. Visually they appear to be some linear variation, in fact, a simple additive relationship that could be modelled by adding a scalar intercept value. This **dependency** relationship can be modelled in a BN, but, what is the best way to do it?

## Latent variables

The solution is elegant from the computational and the statistical point of view. We can think that the sensors are **measuring** some real and not observable temperature from the environment. This hidden value conditions the exact measures that the sensors capture, and thus can model its individual distributions.

In other words, our model requires an additional variable, more concretelly a **Temp** latent variable that cannot be observed but that is captured by each sensor by following a particular dependency, we will represent this by adding a new variable to our data (we will also remove unused variables).

```{r}
dataLatent <- dataRaw %>% 
  mutate(Temp = NA) %>%
  select(Temp, TS1, TS2, TS3) %>%
  as.data.frame()

dataLatent$Temp <- as.numeric(dataLatent$Temp)
```

The next step is to estimate this variable to complete our model. We can learn the three network distributions from the data but not the Temp variable. For that we would use the `EM` algorithm. First, lets build the model:

```{r}
heatAlarmDag1 <- model2network("[Temp][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag1)
```


Now for the EM, unfortunatelly `bnlearn` does not provide a standalone implementation of this algorithm (it does for structural learning EM which is a superset), but it provides the tools to to so. Basically, the parametric EM algorithm can be implemented by:

* **E step:** Impute the missing variables in the data by using Bayesian estimation from the actual model.
* **M step:** Learning new parameters for the model by maximizing the likelihood from the imputten data (MLE).

As you can imagine these steps would be repetead though a number of iterations or convergence. To impute the latent variables we will use the function `inpute`, **however**, we would need an initial model to do so. This is another important element for the EM algorithm, as we would need a good guess to ensure a proper fit.

Lets first try with a random initialization of Temp

```{r}
# Lets add uniform Temp 
dataImputed <- dataLatent %>% rowwise() %>% mutate(Temp = runif(1, 10, 25))

dataImputed %>%
  ggplot(aes(x=Temp)) + geom_density()
```

The EM algorithm is then plain simple

```{r}
parametric.em <- function(dag, dataLatent, dataImputed, iter = 5) {
  fitted <- bn.fit(dag, dataImputed, method = "mle")
  for ( i in seq(iter)) {
   complete <- impute(fitted, data = dataLatent, method = "bayes-lw")
   fitted <- bn.fit(dag, complete, method = "mle")
  }
  fitted
}
```

Let's try it

```{r}
heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
```

Lets see the resulting model:

```{r}
heatAlarmModel1

# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()
```

Look at those intercept coefficient, they do not look right. As we can see in the plot it seems like a forced fit, where the Temp variable has a very different distribution. Lets try another guess at the prior for Temp, this time we will use a Gaussian distribution with the ststistics drawn from the sensors.


```{r}

statistics <- dataLatent %>% 
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  summarise( 
    mu    = mean(Measure),
    sigma = sd(Measure)
  )

dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = rnorm(1, statistics$mu, statistics$sigma))

dataImputed %>%
  ggplot(aes(x=Temp)) + geom_density()
```

Let's repeat the EM:

```{r}
heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
```

```{r}
heatAlarmModel1

# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()
```

This time the fit is even worse, as we overfitted a bad prior. We will try to make a better guess, this time depedant on the actual value of the sensors.



```{r}
dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3)))

dataImputed %>%
  ggplot(aes(x=Temp)) + geom_density()
```

```{r}
heatAlarmModel1 <- parametric.em(heatAlarmDag1, dataLatent, dataImputed, iter = 10)
```

```{r}
heatAlarmModel1

# Sample some data and plot it
impute(heatAlarmModel1, data = dataLatent, method = "bayes-lw") %>%
gather("Sensor", "Measure") %>%
ggplot(aes(x=Measure, color=Sensor)) + geom_density()
```

Now it seems that we got a nice fit, the density graph represents the sensors as scalar transformations of the latent Temp variable.


## Modeling the domain

If we go back to the plot showing the evolution of the temperature along the year we can see that there could be different distributions for different periods of time. If we can obtain this discriminative information we could do better modeling the domain.

We will start by discriminating the distribution by month:

```{r}
dataByMonth <- dataRaw %>%
  mutate(Month = as.factor(month(Date, label = TRUE, abbr = TRUE)))

dataByMonth %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Month)
```

It really seems that the month can capture additional information about the distribution. In this case, we would obtain a better fit to the data if we condition the Month as a discrete parent for the Temp variable. In this case we would need parameters for 12 different gaussian distributions, which may be a bit overkill.

Let's try if we could summarise the month information by creating a lower dimensional variable. If we think about the weather domain we rapidly come up with the year's seasons. Let's try to discriminate the distributions by season.

```{r}
# Lubridate cant extract seasons so we need a custom function

# We better encode the states as a dict to avoid magic strings
SeasonStates <- list(
  Winter = "winter",
  Spring = "spring",
  Summer = "summer",
  Fall = "fall"
)

season <- function(date){
  WS <- as.Date("2012-12-15", format = "%Y-%m-%d") # Winter Solstice
  SE <- as.Date("2012-3-15",  format = "%Y-%m-%d") # Spring Equinox
  SS <- as.Date("2012-6-15",  format = "%Y-%m-%d") # Summer Solstice
  FE <- as.Date("2012-9-15",  format = "%Y-%m-%d") # Fall Equinox
  
  # Convert dates from any year to 2012 dates
  d <- as.Date(strftime(date, format="2012-%m-%d"))
  
  factor(
    ifelse (d >= WS | d < SE, SeasonStates$Winter,
      ifelse (d >= SE & d < SS, SeasonStates$Spring,
        ifelse (d >= SS & d < FE, SeasonStates$Summer, SeasonStates$Fall))),
    levels = SeasonStates
  )
}

dataBySeason <- dataByMonth %>%
  mutate(Season = season(Date))

dataBySeason %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Season)
```

The distributions for the season discrimination seems less Gaussian, but they encode the same pattern. Lets try to model both networks and see how well they fit the parameters.

First we will model the Month based one :

```{r}
dataLatentMonth <- dataByMonth %>%
  mutate(Temp = NA) %>%
  select(Month, Temp, TS1, TS2, TS3) %>%
  as.data.frame()

dataLatentMonth$Temp <- as.numeric(dataLatentMonth$Temp)
```

And now we model the network DAG

```{r}
heatAlarmDagMonth <- model2network("[Month][Temp|Month][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDagMonth)
```

Lets repeat the EM example and induce the parameters of the hidden variable with the previous successful configuration.

```{r}
dataImputedMonth <- dataLatentMonth %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()
```

```{r}
heatAlarmModelMonth <- parametric.em(heatAlarmDagMonth, dataLatentMonth, dataImputedMonth, iter = 10)
```

```{r}
dataImputeMonthPosterior <- impute(heatAlarmModelMonth, data = dataLatentMonth, method = "bayes-lw")

dataImputeMonthPosterior %>%
  gather("Sensor", "Measure", Temp, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Month)
```

This is a very good fit indeed. We can see that Temp is almost identical to TS3, and that the other two sensors are just transformations of this variable. Lets check it the season is a good simplification of the Month variable.

```{r}
dataLatentSeason <- dataBySeason %>%
  mutate(Temp = NA) %>%
  select(Season, Temp, TS1, TS2, TS3) %>%
  as.data.frame()

dataLatentSeason$Temp <- as.numeric(dataLatentSeason$Temp)

heatAlarmDagSeason <- model2network("[Season][Temp|Season][TS1|Temp][TS2|Temp][TS3|Temp]")

dataImputedSeason <- dataLatentSeason %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()
```

With that configuration we run the EM

```{r}
heatAlarmModelSeason <- parametric.em(heatAlarmDagSeason, dataLatentSeason, dataImputedSeason, iter = 10)
```

```{r}
dataImputeSeasonPosterior <- impute(heatAlarmModelSeason, data = dataLatentSeason, method = "bayes-lw")

dataImputeSeasonPosterior %>%
  gather("Sensor", "Measure", Temp, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Season)
```

Another good fit. Which one of the two approaches is better?. The initial guess should be the Season one, as it has fewer parameters than using the Month. Let's ask the BIC score.

```{r}
print(BIC(heatAlarmModelSeason, dataImputeSeasonPosterior))
print(BIC(heatAlarmModelMonth, dataImputeMonthPosterior))
```
As expected, season is by far more superior at minimizing BIC than month and should be our pick.


## Modeling anomalies

One of the most appealing aspects of Bayesian Networks is the ability to include expert knowledge in the model. While this is almost impossible in other frameworks it is a natural concepts in this case.

We call BNs open box models because we can understand and interpretate the parameters. This is really useful for modelling anomalies, because we can use latent variables and expert knowledge to represent events that cannot be observed.

If the events cannot be observed we would not have available data to learn from. In our example we would have to destroy machines and lit the room on fire to capture data about anomalies, and that would surely have a high cost. Imagine other scenarios such as rare diseases, autonomous driving or financial fraud.

We will start by modeling a global fault by adding a new variable `Alarm`, this variable will have two states `yes` and `no`. In this example we have been collecting data from the sensors for a year, we assume that the functioning has been correct and that there has not been recorder failures (perhaps because the expert technicians that maintain the room have assured us so).

Given this, the Alarm variable will be always observed to `no`.

```{r}
# New Variables
AlarmStates <- list(
  No  = "no",
  Yes = "yes"
)

dataLatent <- dataLatentSeason %>%
  mutate(Alarm = factor(AlarmStates$No, levels = AlarmStates)) %>%
  select(Alarm, Season, Temp, TS1, TS2, TS3) %>%
  as.data.frame()
```

And now the new dag:

```{r}
heatAlarmDag <- model2network("[Alarm][Season][Temp|Season:Alarm][TS1|Temp][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag)
```

Lets run the EM algorithm one more time, ther should be no changes to the parameters but it will reinforce our logical pipeline.

```{r}
dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()
```

```{r}
heatAlarmModel <- parametric.em(heatAlarmDag, dataLatent, dataImputed, iter = 10)
```

Lets check the result on the Alarm variable:

```{r}
heatAlarmModel$Alarm
```

Obviously the algorithm will fit all the cases to the observed population in which no alarms have occurred. Here is where the **expert knowledge** enters. We will contact our experts and design a proper parametrization of the variable.

> "The system has a random failure rate of 1/1000, in such cases, the room temperature is quickly raised about 10".

We have enough information to include this expert knowledge into de Alarm variable. Notice that this variable introduces additional parameters into the Temp variable that should be modelled to.

```{r}
cptAlarm <- coef(heatAlarmModel$Alarm)
print(cptAlarm)
```

```{r}
cptAlarm[] <- c(0.999, 0.001)
heatAlarmModel$Alarm <- cptAlarm
print(heatAlarmModel$Alarm)
```

Now for the Temp variable, which a little more complex

```{r}
cgaussTemp <- coef(heatAlarmModel$Temp)
sdTemp     <- heatAlarmModel$Temp$sd
print(cgaussTemp)
print(sdTemp)
```

```{r}
cgaussTemp[is.nan(cgaussTemp)] <- cgaussTemp[!is.nan(cgaussTemp)] + 10
sdTemp[is.nan(sdTemp)] <- sdTemp[!is.nan(sdTemp)]

heatAlarmModel$Temp <- list( coef = cgaussTemp, sd = sdTemp)
heatAlarmModel$Temp
```

We have a complete model now, with a modelled anomaly detection schema. Lets test some queries.


```{r}
e <- list( "Season" = SeasonStates$Winter, "TS1" = 23, "TS2" = 33, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
```

This temperature is too high for winter, so the alarm is likely to go up.

```{r}
e <- list( "Season" = SeasonStates$Summer, "TS1" = 23, "TS2" = 33, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
```

However this is a good fit for the higher temperatures that are registered in summer.


## More anomalies

We could model additional anomalies for the individual malfunctioning of each sensor. In that case we could observe either global temperature anomalies or individual malfunctions.

A common pattern is to add a new hidden discrete variable conditioning each sensor with a random distribution in the case of malfunction.

```{r}
# New Variables
TS1FaultStates <- list(
  No  = "no",
  Yes = "yes"
)

dataLatent <- dataLatentSeason %>%
  mutate(
    Alarm = factor(AlarmStates$No, levels = AlarmStates),
    TS1Fault = factor(TS1FaultStates$No, levels = TS1FaultStates)
  ) %>%
  select(Alarm, Season, Temp, TS1Fault, TS1, TS2, TS3) %>%
  as.data.frame()

dataImputed <- dataLatent %>% 
  rowwise() %>% 
  mutate(Temp = mean(c(TS1, TS2, TS3))) %>%
  as.data.frame()

heatAlarmDag <- model2network("[Alarm][Season][TS1Fault][Temp|Season:Alarm][TS1|Temp:TS1Fault][TS2|Temp][TS3|Temp]")
plotD3bn(heatAlarmDag)
```

We launch EM and modify the new probability of the nodes (now Alarm and TS1 Fault). For that we need additional exprt knowledge from the domain. We ask our IoT team about the fault tolerance of our sensors. They tell us the following>

> Our sensors are the best, they usually do not fail. The testing guys estimate just a 1/1000 failures.

This is good, with this marginal error probbility out new node will only detect anomalies without introducing noise in the remaining propagation within the model.

```{r}
heatAlarmModel <- parametric.em(heatAlarmDag, dataLatent, dataImputed, iter = 10)
```

```{r}
# TODO: Please dont do this, this is not DRY...
cptAlarm[] <- c(0.999, 0.001)
heatAlarmModel$Alarm <- cptAlarm
cgaussTemp <- coef(heatAlarmModel$Temp)
sdTemp     <- heatAlarmModel$Temp$sd
cgaussTemp[is.nan(cgaussTemp)] <- cgaussTemp[!is.nan(cgaussTemp)] + 10
sdTemp[is.nan(sdTemp)] <- sdTemp[!is.nan(sdTemp)]
heatAlarmModel$Temp <- list( coef = cgaussTemp, sd = sdTemp)
```


```{r}
cptTS1Fault <- coef(heatAlarmModel$TS1Fault)
print(cptTS1Fault)
```

```{r}
cptTS1Fault[] <- c(0.9999, 0.0001)
heatAlarmModel$TS1Fault <- cptTS1Fault
print(heatAlarmModel$TS1Fault)
```

And now for the TS1 node, in which we have to add a random distribution. We can use a Gaussian with lots of variance.

```{r}
cgaussTS1 <- coef(heatAlarmModel$TS1)
sdTS1     <- heatAlarmModel$TS1$sd
print(cgaussTS1)
print(sdTS1)
```

```{r}
cgaussTS1[is.nan(cgaussTS1)] <- 0
sdTS1[is.nan(sdTS1)] <- 1000000

heatAlarmModel$TS1 <- list( coef = cgaussTS1, sd = sdTS1)
heatAlarmModel$TS1
```

Lets test some inference cases in which the TS1 sensor is malfunctioning or reporting some kind of anomalous measures.

If TS1 is only one capturing a very high temperature the fault node will detect the anomaly

```{r}
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 24, "TS3" = 20 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
```

```{r}
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 24, "TS3" = 20 )
query <- cpquery(heatAlarmModel, event = TS1Fault == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
```

If the three sensors are capturing the same anomaly, then the TS1Fault node will assume that the problem is outside the model.

```{r}
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 37, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = Alarm == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
```

```{r}
e <- list( "Season" = SeasonStates$Winter, "TS1" = 35, "TS2" = 37, "TS3" = 29 )
query <- cpquery(heatAlarmModel, event = TS1Fault == "yes", evidence = e, method = "lw", n = 100000)
print(mean(rep(query, 10000)))
```



## A more complex problem

This is a variation on the previous problem. Can you spot the differences introduced by this new version?

```{r}
dataRaw <- read_csv("./data/heatAlarm-lvl2.csv")
```

It seems that the measures are somehow dependent, but the relationship is not that clear that in the previous phase. An ex

```{r}
dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_line()

dataRaw %>%
  gather("Sensor", "Measure", TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_smooth()
```

### Looking at the context

> Our experts tell us that the indoor temperature of the room is affected by the outdoor weather, specially near the windows and doors. The temperature in the interior part of the room is more stable.

Given this new knowledge we decide to place a new sensor outdoors `TSO`. Now we can have an additional component to add to our model, hopefully it will allow to mixture the indoor and outdoor temperatures.

```{r}
dataRaw %>%
  gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
  ggplot(aes(x=Date, y=Measure, group=Sensor, color=Sensor)) +
  geom_smooth()
```

```{r}
dataByMonth <- dataRaw %>%
  mutate(Month = as.factor(month(Date, label = TRUE, abbr = TRUE)))

dataByMonth %>%
  gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.5) +
    facet_wrap(~Month)
```

```{r}
dataBySeason <- dataByMonth %>%
  mutate(Season = season(Date))

dataBySeason %>%
  gather("Sensor", "Measure", TSO, TS1, TS2, TS3) %>%
  ggplot(aes(x=Measure, color=Sensor, fill=Sensor)) + 
    geom_density(alpha=0.4) +
    facet_wrap(~Season)
```

### Adding new components to the model

To add the new sensor to the model we are going to need an additional latent variable. Our model will now have two latent variables to model the temperature `TInd` and `TOut` (indoor and outdoor). The Indoor temperature will depend on the outdoor temperature. Lets build it:

```{r}
heatAlarmDag2 <- model2network("[Alarm][Season][TOut|Season][TS0|TOut][TInd|TOut:Alarm][TS1|TInd][TS2|TInd][TS3|TInd]")

plotD3bn(heatAlarmDag2)
```

This model is the base example of the new domain. However we can have some doubts regarding the relationship of the variable TOut and the models, look closely at the distributions and you will see that not all distributions are affected equally by the outdoor temperature. We ask again...

> IoT team: Uh, yes. We instaled TS1 in a corner of the room, near the entrance. TS2 is on the ceiling and might be near a ventilation shaft, so it will probably capture a bit more of heat from the machines. TS3 should be more stable, it is located in the middle of the room.

This almost make sense visually. However, at this point we will need to test the different structures, either by learning all of them and comparing with BIC, or running a structural learning process.

---

---

# License

This tutorial is licensed under [GPL3 license](https://www.gnu.org/licenses/lgpl-3.0.en.html) as is originally published and maintained in this [github repository](https://github.com/jacintoArias/bayesnetRtutorial) by [Jacinto Arias](http://jarias.es)
