Value of Information in the Binary Case and Confusion Matrix

Belavkin, Roman; Pardalos, Panos; Principe, Jose

doi:10.3390/psf2022005008

Open AccessProceeding Paper

Value of Information in the Binary Case and Confusion Matrix^†

by

Roman Belavkin

^1,*

,

Panos Pardalos

²

and

Jose Principe

³

¹

Faculty of Science and Technology, Middlesex University, London NW4 4BT, UK

²

Department of Industrial & Systems Engineering, University of Florida, P.O. Box 116595, Gainesville, FL 32611-6595, USA

³

Department of Electrical & Computer Engineering, University of Florida, P.O. Box 116130, Gainesville, FL 32611-6130, USA

^*

Author to whom correspondence should be addressed.

^†

Presented at the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris, France, 18–22 July 2022.

Phys. Sci. Forum 2022, 5(1), 8; https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005008

Published: 2 November 2022

(This article belongs to the Proceedings of The 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The simplest Bayesian system used to illustrate ideas of probability theory is a coin and a boolean utility function. To illustrate ideas of hypothesis testing, estimation or optimal control, one needs to use at least two coins and a confusion matrix accounting for the utilities of four possible outcomes. Here we use such a system to illustrate the main ideas of Stratonovich’s value of information (VoI) theory in the context of a financial time-series forecast. We demonstrate how VoI can provide a theoretical upper bound on the accuracy of the forecasts facilitating the analysis and optimization of models.

Keywords:

value of information; Shannon’s information; confusion matrix; time-series forecast

1. Introduction

The concept of value of information has different definitions in the literature [1,2]. Here we follow the works of Ruslan Stratonovich and his colleagues, who were inspired by Shannon’s work on rate distortion [3] and made a number of important developments in the 1960s [2]. These mainly theoretical results are gaining new interest thanks to the advancements in data science and machine learning and the need for a deeper understanding of the role of information in learning. We shall review the value of information theory in the context of optimal estimation and hypothesis testing, although the context of optimal control is also relevant.

Consider a probability space

(Ω, P, A)

and a random variable

x : Ω \to X

(a measurable function). The optimal estimation of

x \in X

is the problem of finding an element

y \in Y

maximizing the expected value of some utility function

u : X \times Y \to R

(or minimizing for cost

- u

). The optimal value is

U (0) : = sup_{y \in Y} E_{P (x)} {u (x, y)},

where zero designates the fact that no information about the specific value of

x \in X

is given, only the prior distribution

P (x)

. At the other extreme, let

z \in Z

be another random variable that communicates full information about each realization of x. This entails that there is an invertible function

z = f (x)

such that

x = f^{- 1} (z)

is determined uniquely by the ‘message’

z \in Z

. The corresponding optimal value is

U (\infty) : = E_{P (x)} {sup_{y (z)} u (x, y (z))},

where an optimal y is found for each z (i.e., optimization over all mappings

y : Z \to Y

). In the context of estimation, variable x is the response (i.e., the variable of interest) and z is the predictor. The mapping

y (z)

represents a model with output

y \in Y

.

Let

I \in [0, \infty]

be the intermediate amounts of information, and let

U (I) \in [U (0), U (\infty)]

be the corresponding optimal values. The value of information is the difference [4]:

V (I) : = U (I) - U (0) .

There are, however, different ways in which the information amount I and the quantity

U (I)

can be defined, leading to different types of the value function

V (I)

. For example, consider a mapping

f : X \to Z

with a constraint

| Z | \leq e^{I} < | X |

on the cardinality of its image. The mapping f partitions its domain into a finite number of subsets

f^{- 1} (z) = {x \in X : f (x) = z}

. Then, given a specific partition

z (x)

, one can find optimal

y (z)

maximizing the conditional expected utility

E_{P (x ∣ z)} {u (x, y) ∣ z}

for each subset

f^{- 1} (z) ∋ x

. This optimization should be repeated for different partitions

z (x)

, and the optimal value

U (I)

is defined over all partitions

z (x)

, satisfying the cardinality constraint

ln | Z | \leq I

:

U (I) : = sup_{z (x)} [E_{P (z)} \{sup_{y (z)} E_{P (x ∣ z)} {u (x, y) ∣ z}\} : ln | Z | \leq I]

(1)

Here,

P (z) = P {x \in f^{- 1} (z)}

. The quantity

I = ln | Z |

is called Hartley’s information, and the difference

V (I) = U (I) - U (0)

in this case is the value of Hartley’s information. One can relax the cardinality constraint and replace it with the constraint on entropy

H (Z) \leq I

, where

H (Z) = - E_{P (z)} {ln P (z)} \leq ln | Z |

. In this case,

V (I)

is called the value of Boltzmann’s information [4].

One can see from Equation (1) that the computation of the value of Hartley’s or Boltzmann’s information is quite demanding and may involve a procedure such as the k-means clustering algorithm or training a multilayer neural network. Thus, using these values of information is not practical due to high computational costs. The main result of Stratonovich’s theory [4] is that the upper bound on Hartley’s or Boltzmann’s values of information is given by the value of Shannon’s information, and that asymptotically all these values are equivalent (Theorems 11.1 and 11.2 in [4]). The value of Shannon’s information is much easier to compute.

Recall the definition of Shannon’s mutual information [3]:

I (X, Y) : = E_{W (x, y)} \{ln \frac{P (x ∣ y)}{P (x)}\} = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X),

where

W (x, y) = P (x ∣ y) Q (y)

is the joint probability distribution on

X \times Y

, and

H (X ∣ Y)

is the conditional entropy. Under broad assumptions on the reference measures (see Theorem 1.16 in [4]), the following inequalities are valid:

0 \leq I (X, Y) \leq min {H (X), H (Y)} \leq min {ln | X |, ln | Y |} .

The value of Shannon’s information is defined using the quantity:

U (I) : = sup_{P (y ∣ x)} [E_{W} {u (x, y)} : I (X, Y) \leq I]

(2)

The optimization above is over all conditional probabilities

P (y ∣ x)

(or joint measures

W (x, y) = P (y ∣ x) P (x)

) satisfying the information constraint

I (X, Y) \leq I

. Contrast this with

U (I)

for Hartley’s or Boltzmann’s information (1), where optimization is over the mappings

y (x) = y \circ z (x)

. As was pointed out in [5], the relation between functions (1) and (2) is similar to that between optimal transport problems in the Monge and Kantorovich formulations. Joint distributions optimal in the sense of (2) are found using the standard method of Lagrange multipliers (e.g., see [4,6]):

W (x, y; β) = P (x) Q (y) e^{β u (x, y) - γ (β, x)},

(3)

where parameter

β^{- 1}

, called temperature, is the Lagrange multiplier associated with the constraint

I (X, Y) \leq I

. Distributions P and Q are the marginals of W, and function

γ (β, x)

is defined by normalization

\sum_{x, y} W (x, y; β) = 1

. In fact, taking partial traces of solution (3) gives two equations:

\begin{matrix} \sum_{x} W (x, y) = Q (y) & \Rightarrow \sum_{x} e^{β u (x, y) - γ (β, x)} P (x) = 1 \end{matrix}

(4)

\begin{matrix} \sum_{y} W (x, y) = P (x) & \Rightarrow \sum_{y} e^{β u (x, y)} Q (y) = e^{γ (β, x)} \end{matrix}

(5)

Equation (5) defines function

γ (β, x) = ln \sum_{y} e^{β u (x, y)} Q (y)

. If the linear transformation

T (\cdot) = \sum_{x} e^{β u (x, y)} (\cdot)

has an inverse, then from Equation (4) one obtains

e^{- γ (β, x)} P (x) = T^{- 1} (1)

or

γ (β, x) = - ln \sum_{y} b (x, y) + ln P (x) = γ_{0} (β, x) - h (x),

where

γ_{0} (β, x) : = - ln \sum_{y} b (x, y)

,

b (x, y)

is the kernel of the inverse transformation

T^{- 1}

, and

h (x) = - ln P (x)

is random entropy or surprise. Integrating the above with respect to measure

P (x)

we obtain

Γ (β) : = \sum_{x} γ (β, x) P (x) = Γ_{0} (β) - H (X),

where

Γ_{0} (β) : = \sum_{x} γ_{0} (β, x) P (x)

. Function

Γ (β)

is the cumulant generating function of optimal distribution (3). Indeed, the expected utility and Shannon’s information for this distribution are

U (β) = Γ^{'} (β) = Γ_{0}^{'} (β), I (β) = β Γ^{'} (β) - Γ (β) = H (X) - [Γ_{0} (β) - β Γ_{0}^{'} (β)] .

The first formula can be obtained directly by differentiating

Γ (β)

, and the second by substitution of (3) into the formula for Shannon’s mutual information. Function

Γ_{0} (β) - β Γ_{0}^{'} (β)

is clearly the conditional entropy

H (X ∣ Y)

because

I (X, Y) = H (X) - H (X ∣ Y)

.

Note that information is the Legendre–Fenchel transform

I (U) = sup {β U - Γ (β)}

of convex function

Γ (β)

(indeed,

U = Γ^{'} (β)

). The inverse of

I (U)

is the optimal value

U (I)

from Equation (2) defining the value of Shannon’s information, and it is the Legendre–Fenchel transform

U (I) = inf {β^{- 1} I - F (β^{- 1})}

of concave function

F (β^{- 1}) = - β^{- 1} Γ (β)

, which is called free energy.

The general strategy for computing the value of Shannon’s information is to derive the expressions for

U (β)

and

I (β)

from function

Γ_{0} (β)

(alternatively, one can obtain

U (β^{- 1})

and

I (β^{- 1})

from free energy

F_{0} (β^{- 1}) = - β^{- 1} Γ_{0} (β)

). Then the dependency

U (I)

is obtained either parametrically or by excluding

β

. Let us now apply this to the simplest

2 \times 2

case.

2. Value of Shannon’s Information for the $2 \times 2$ System

Let

X \times Y = {x_{1}, x_{2}} \times {y_{1}, y_{2}}

, and let

u : X \times Y \to R

be the utility function, which we can represent by a

2 \times 2

matrix:

∥ u (x, y) ∥ = [\begin{matrix} u (x_{1}, y_{1}) & u (x_{1}, y_{2}) \\ u (x_{2}, y_{1}) & u (x_{2}, y_{2}) \end{matrix}] = [\begin{matrix} u_{11} & u_{12} \\ u_{21} & u_{22} \end{matrix}] = [\begin{matrix} c_{1} + d_{1} & c_{1} - d_{1} \\ c_{2} - d_{2} & c_{2} + d_{2} \end{matrix}]

It is called the confusion matrix in the context of hypothesis testing, where rows correspond to the true states

{x_{1}, x_{2}}

, and columns correspond to accepting or rejecting the hypothesis

{y_{1}, y_{2}}

. The set of all joint distributions

W (x, y)

is a 3-simplex (tetrahedron), shown in Figure 1. The 2D surface in the middle is the set of all product distributions

W (x, y; 0) = P (x) Q (y)

, which correspond to the minimum

I (X, Y) = 0

of mutual information (independent

x, y

). With no additional information about x, the decision

y_{1}

to accept or

y_{2}

to reject the hypothesis is completely determined by the utilities and prior probabilities

P (x_{1}) = p

and

P (x_{2}) = 1 - p

. Thus, one has to compare expected utilities

E_{P} {u ∣ y_{1}} = p u_{11} + (1 - p) u_{21}

and

E_{P} {u ∣ y_{2}} = p u_{12} + (1 - p) u_{22}

. The output distribution

Q (y)

is an elementary

δ

-distribution:

Q (y_{1}) = \{\begin{matrix} 1 & if \frac{p}{1 - p} \geq \frac{u_{22} - u_{21}}{u_{11} - u_{12}} = \frac{d_{2}}{d_{1}} \\ 0 & otherwise \end{matrix}

The optimal value corresponding to

I = 0

information is

U (0) = p c_{1} + (1 - p) c_{2} + | p d_{1} - (1 - p) d_{2} |

. In the case when

c_{1} = c_{2} = c

and

d_{1} = d_{2} = d

, the condition for

y_{1}

is

d (2 p - 1) \geq 0

and

U (0) = c + d | 2 p - 1 |

. With

c = 1 / 2

and

d = 1 / 2

, the value

U (0) = \frac{1}{2} + \frac{1}{2} | 2 p - 1 |

represents the best possible accuracy for prior probabilities

P (x) \in {p, 1 - p}

. If additional information about x is communicated, say by some random variable

z \in Z

, then the maximum possible improvement

V (I) = U (I) - U (0)

is the value of this information. The first step in deriving function

U (I)

for the value of Shannon’s information (2) is to obtain the expression for function

Γ (β) = Γ_{0} (β) - H (X)

.

Writing Equation (4) in the matrix form

∥ e^{β u (x, y)} ∥^{T} P (x) e^{- γ (β, x)} = 1

and using the inverse matrix

(∥ e^{β u (x, y)} {∥^{T})}^{- 1}

gives the solution for function

e^{- γ_{0} (β, x)} = P (x) e^{- γ (β, x)}

:

[\begin{matrix} p e^{- γ (β, x_{1})} \\ (1 - p) e^{- γ (β, x_{2})} \end{matrix}] = {[\begin{matrix} e^{β u_{11}} & e^{β u_{21}} \\ e^{β u_{12}} & e^{β u_{22}} \end{matrix}]}^{- 1} [\begin{matrix} 1 \\ 1 \end{matrix}] = \frac{1}{det ∥ e^{β u} ∥^{T}} [\begin{matrix} e^{β u_{22}} & - e^{β u_{21}} \\ - e^{β u_{12}} & e^{β u_{11}} \end{matrix}] [\begin{matrix} 1 \\ 1 \end{matrix}],

where

det ∥ e^{β u} ∥^{T} = e^{β (u_{11} + u_{22})} - e^{β (u_{12} + u_{21})} = 2 e^{β (c_{1} + c_{2})} sinh [β (d_{1} + d_{2})]

. This gives two equations:

\begin{matrix} p e^{- γ (β, x_{1})} & = \frac{e^{β u_{22}} - e^{β u_{21}}}{e^{β (u_{11} + u_{22})} - e^{β (u_{12} + u_{21})}} = e^{- β c_{1}} \frac{sinh (β d_{2})}{sinh [β (d_{1} + d_{2})]} = : e^{- γ_{0} (β, x_{1})} \\ (1 - p) e^{- γ (β, x_{2})} & = \frac{e^{β u_{11}} - e^{β u_{12}}}{e^{β (u_{11} + u_{22})} - e^{β (u_{12} + u_{21})}} = e^{- β c_{2}} \frac{sinh (β d_{1})}{sinh [β (d_{1} + d_{2})]} = : e^{- γ_{0} (β, x_{2})} \end{matrix}

Therefore, the expression for function

Γ_{0} (β) : = p γ_{0} (β, x_{1}) + (1 - p) γ_{0} (β, x_{2})

is

Γ_{0} (β) = β [p c_{1} + (1 - p) c_{2}] + ln | sinh [β (d_{1} + d_{2})] | - p ln | sinh (β d_{2}) | - (1 - p) ln | sinh (β d_{1}) | .

Its first derivative

Γ_{0}^{'} (β)

gives the expression for

U (β)

:

U (β) = p c_{1} + (1 - p) c_{2} + (d_{1} + d_{2}) coth [β (d_{1} + d_{2})] - p d_{2} coth (β d_{2}) - (1 - p) d_{1} coth (β d_{1}) .

The expression for information is obtained from

I (β) = H (X) - [Γ_{0} (β) - β Γ_{0}^{'} (β)]

, where

H (X) = - p ln p - (1 - p) ln (1 - p)

. Two functions

U (β)

and

I (β)

define parametric dependency

U (I)

for the value of Shannon’s information (2).

Notice that function

Γ_{0} (β)

(and hence

U (β)

and

I (β)

) depends in general on

P (x) \in {p, 1 - p}

. If, however,

c_{1} = c_{2} = c

and

d_{1} = d_{2} = d

, then, using the formula

\frac{sinh (2 x)}{sinh (x)} = 2 cosh (x)

, we obtain simplified expressions:

Γ_{0} (β) = β c + ln [2 cosh (β d)]

and

\begin{matrix} U (β) = c + d tanh (β d), I (β) = H (X) - [ln [2 cosh (β d)] - β d tanh (β d)] . \end{matrix}

Let us denote

θ : = \frac{U - c}{d} = tanh (β d) \in [0, 1]

. Then the expression for information is

\begin{matrix} I (θ) & = H (X) - ln [2 cosh ({tanh}^{- 1} θ)] + θ {tanh}^{- 1} θ \\ = H (X) + ln \frac{1}{2} + \frac{1}{2} ln (1 - θ^{2}) + \frac{1}{2} θ ln \frac{1 + θ}{1 - θ} \\ = H_{2} [p] - H_{2} [\frac{1 + θ}{2}] . \end{matrix}

In the first step we used the formulae

cosh ({tanh}^{- 1} θ) = \frac{1}{\sqrt{1 - θ^{2}}}

and

{tanh}^{- 1} θ = \frac{1}{2} ln \frac{1 + θ}{1 - θ}

. The last equation is written using binary entropies

H_{2} [p] = - p ln p - (1 - p) ln (1 - p)

, which shows that an increase of information in a binary system is directly related to an increase of the probability

(1 + θ) / 2 \geq max {p, 1 - p}

due to conditioning on the ‘message’

z \in Z

about the realization of

x \in X

. Additionally, substituting

θ = (U - c) / d

we obtain the closed-form expression:

I (U) = H_{2} [p] - H_{2} [\frac{1}{2} + \frac{1}{2} \frac{U - c}{d}]

(6)

Let us derive the equations for the output probabilities

Q (y) = \sum_{x} P (y ∣ x) P (x)

. This can be done using Equation (5), which in the matrix form is

∥ e^{β u (x, y)} ∥ Q (y) = e^{γ (β, x)}

. Thus, we obtain

[\begin{matrix} q \\ 1 - q \end{matrix}] = {[\begin{matrix} e^{β u_{11}} & e^{β u_{12}} \\ e^{β u_{21}} & e^{β u_{22}} \end{matrix}]}^{- 1} [\begin{matrix} e^{γ (β, x_{1})} \\ e^{γ (β, x_{2})} \end{matrix}] = \frac{1}{det ∥ e^{β u} ∥} [\begin{matrix} e^{β u_{22}} & - e^{β u_{12}} \\ - e^{β u_{21}} & e^{β u_{11}} \end{matrix}] [\begin{matrix} e^{γ (β, x_{1})} \\ e^{γ (β, x_{2})} \end{matrix}],

where

det ∥ e^{β u} ∥ = e^{β (u_{11} + u_{22})} - e^{β (u_{12} + u_{21})} = 2 e^{β (c_{1} + c_{2})} sinh [β (d_{1} + d_{2})]

. This gives two equations:

Q (y_{1}) = \frac{p}{1 - e^{- 2 β d_{2}}} + \frac{1 - p}{1 - e^{2 β d_{1}}}, Q (y_{2}) = \frac{1 - p}{1 - e^{- 2 β d_{1}}} + \frac{p}{1 - e^{2 β d_{2}}} .

It is easy to check that

Q (y_{1}) + Q (y_{2}) = 1

. Additionally, if

p = 1 - p

, then

Q (y_{1}) \geq 0

and

Q (y_{2}) \geq 0

for all

β \geq 0

. However, when

p \neq 1 - p

, there exists

β_{0} > 0

such that either

Q (y_{1}) < 0

or

Q (y_{2}) < 0

for

β \in [0, β_{0})

. The value

β_{0}

can be found from

Q (y_{1}) = 0

or

Q (y_{2}) = 0

. For

d_{1} = d_{2} = d

this value is

β_{0} = \frac{1}{2 d} |ln (\frac{p}{1 - p})| .

One can show that

I (β_{0}) = 0

and

U (β_{0}) = c + d | 2 p - 1 |

. Thus, the output probabilities are non-negative for all

β \geq β_{0}

, which corresponds to positive information

I \geq 0

and

U (I) \geq U (0)

.

It is important to note that in the limit

β \to \infty

, corresponding to an increase of information to its maximum, the output probabilities

Q (y) \in {q, 1 - q}

converge to

P (x) \in {p, 1 - p}

.

3. Application: Accuracy of Time-Series Forecasts

In this section, we illustrate how the value of information can facilitate the analysis of the performance of data-driven models. Here we use financial time-series data and predict the signs of future log returns. Thus, if

s (t)

and

s (t - 1)

are prices of an asset at two time moments, then

r (t) = ln [s (t) / s (t - 1)]

is the log-return at t. The models will try to predict whether the future log return

r (t + 1)

is positive or negative. Thus, we have a

2 \times 2

system, where

x \in {x_{1}, x_{2}}

is the true sign, and

y \in {y_{1}, y_{2}}

is the prediction. The accuracy of different models will be evaluated against the theoretical upper bound, defined by the value of information.

The data used here are from the set of close-day prices

s (t)

of several cryptocurrency pairs between 1 January 2019 and 11 January 2021. Figure 2 shows the price of Bitcoin against USD (left) and the corresponding log returns (right). Predicting price changes is very challenging. In fact, in economics, log returns are often assumed to be independent (and hence prices

s (t)

are assumed to be Markov). Indeed, one can see no obvious relation on the left chart on Figure 3, which plots logreturns

r (t)

(abscissa) and

r (t + 1)

(ordinates). In reality, however, some amounts of information and correlations exist, which can be seen from the plot of the autocorrelation function for BTC/USD shown on the right chart of Figure 3.

The idea of autoregressive models is to use the small amounts of information between the past and future values for forecasts. In addition to autocorrelations (correlations between the values of

{r (t)}

at different times), information can be increased by using cross-correlations (correlations between log-returns of different symbols in the dataset). Thus, the vector of predictors used here is an

m \times n

-tuple, where m is the number of symbols used, and n is the number of time lags. In this paper, we report the results of models using the range

m \in {1, 2, \dots, 5}

of symbols (BTC/USD, ETH/USD, DAI/BTC, XRP/BTC, IOT/BTC) and

n \in {2, 3, \dots, 20}

of lags. This means that the models used predictors

(z_{1}, \dots, z_{m \times n})

, where

m \times n

ranged from 2 to 100. The model output

y (z)

is the forecast of the sign

x \in {- 1, 1}

(the response) of future log return

r (t + 1)

of BTC/USD. Here we report results from the following models:

Logistic regression (LM). This model has no hyperparameters.
Partial least squares discrimination (PLSD). We used the SIMPLS algorithm [7] with three components.
Feed-forward neural network (NN). Here we used one hidden layer with three logistic units.

In order to analyse the performance of models using the value of information, one has to estimate the amount of information between the predictors

z_{1}, \dots, z_{m \times n}

and the response variable x. Here we employ two methods. The first uses the following Gaussian formula [4]:

I (X, Z) \approx \frac{1}{2} [ln det K_{z} + ln det K_{x} - ln det K_{z \oplus x}],

where

K_{i}

are the covariance matrices. Because the distributions of log returns are generally not Gaussian, this formula is an approximation (in fact, it gives a lower bound). The second method is based on the discretization of continuous variables. Because models were used to predict signs of log returns, here we used discretization into two subsets. Figure 4 shows the average amounts of information

I (X, Z)

in the training sets, computed using the Gaussian formula (left) and using binary discretization (right). Information (ordinates) is plotted against the number n of lags (abscissa) and for

m \in {1, 2, \dots, 5}

symbols (different curves). One can see that the amounts of information using Gaussian approximation (left) are generally lower than those using discretization (right). We note, however, that linear models can only use linear dependencies (correlations), which means that Gaussian approximation is sufficient for assessing the performance of linear models, such as LM and PLSD. Non-linear models, on the other hand, can potentially use all information present in the data. Therefore, we used information estimated with the second method to assess the performance of NN.

For each collection of predictors

(z_{1}, \dots, z_{m \times n})

and response x, the data were split into multiple training and testing subsets using the following rolling window procedure: we used 200- and 50-day data windows for training and testing, respectively; after training and testing the models, the windows were moved forward by 50 days and the process repeated. Thus, the data of approximately 700 days (January 2019 to January 2021) were split into

(700 - 200) / 50 = 10

pairs of training and testing sets. The results reported here are the average of results from these 10 subsets.

Figure 5 shows the accuracies of models plotted against information amounts I in the training data. The top row shows results on the training sets (i.e. fitted values) and the bottom row for new data (i.e., predicted values). Different curves are plotted for different numbers of symbols

m \in {1, \dots, 5}

. The theoretical upper bounds are shown by the

Accuracy (I)

curves computed using the inverse of function (6) with

c = d = 1 / 2

and

p = 1 / 2

. Here we note the following observations:

The accuracy of fitting the training data closely follows theoretical curve $Accuracy (I)$ . The accuracy of predicting new data (testing sets) is significantly lower.
Increasing information increases the accuracy on training data, but not necessarily on new data.
Models using $m > 1$ symbols appear to achieve better accuracy than models using $m = 1$ symbol with the same amounts of information. Thus, surprisingly, cross-correlations potentially provide more valuable information for forecasts than autocorrelations.

4. Discussion

We have reviewed the main ideas of Stratonovich’s value of information theory [2,4] and applied it to the simplest

2 \times 2

Bayesian system. We explicitly performed the main computations for the cumulant generating function

Γ (β) = Γ_{0} (β) - H (X)

and derived functions

U (β)

and

I (β)

defining the dependency

U (I)

and the value of Shannon’s information

V (I) = U (I) - U (0)

. The main application of the considered binary example the is evaluation of the accuracy of model predictions or hypothesis testing. The analysis the of performance of data-driven models can be enriched by the use of the value of information. However, one needs to be careful about the estimation of the amount of information in the data. Gaussian approximation of mutual information can be used for linear models. However, other techniques should be used for the analysis of non-linear models, such as neural networks. Here we applied the value of information to the analysis of financial time-series forecasts. These methods can be generalized to many other machine learning and data science problems.

Author Contributions

Conceptualization and formal analysis, R.B.; methodology, R.B., P.P. and J.P.; software, R.B.; modelling and experiments, R.B.; data curation, R.B., P.P. and J.P.; writing—original draft preparation, R.B.; writing—review and editing, R.B., P.P. and J.P.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the ONR grant number N00014-21-1-2295.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Close day prices used in this work are available at https://0-doi-org.brum.beds.ac.uk/10.22023/mdx.21436248.

Acknowledgments

Stefan Behringer is deeply acknowledged for additional discussion of the example, and Roman Tarabrin is deeply acknowledged for providing a MacBookPro laptop used for the computational experiments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Howard, R.A. Information Value Theory. IEEE Trans. Syst. Sci. Cybern. 1966, 2, 22–26. [Google Scholar] [CrossRef]
Stratonovich, R.L. On value of information. Izv. USSR Acad. Sci. Tech. Cybern. 1965, 5, 3–12. (In Russian) [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423 and 623–656. [Google Scholar] [CrossRef] [Green Version]
Stratonovich, R.L. Theory of Information and Its Value; Springer: Cham, Switzerland, 2020. [Google Scholar]
Belavkin, R.V. Relation Between the Kantorovich-Wasserstein Metric and the Kullback-Leibler Divergence. In Proceedings of the Information Geometry and Its Applications, Liblice, Czech Republic, 12–17 June 2016; Ay, N., Gibilisco, P., Matúš, F., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 363–373. [Google Scholar]
Belavkin, R.V. Optimal measures and Markov transition kernels. J. Glob. Optim. 2013, 55, 387–416. [Google Scholar] [CrossRef] [Green Version]
de Jong, S. SIMPLS: An alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst. 1993, 18, 251–263. [Google Scholar] [CrossRef]

Figure 1. A 3-simplex of all joint distributions on a

2 \times 2

system.

Figure 1. A 3-simplex of all joint distributions on a

2 \times 2

system.

Figure 2. Close day prices of BTC/USD (left) and the corresponding log returns (right).

Figure 3. Log returns of BTC/USD on two consecutive days (A); the autocorrelation function (B).

Figure 4. The average amounts of mutual information between predictors and response in the training sets, computed using Gaussian approximation (left) and using binary discretization (right). The abscissa shows the numbers n of lags; different curves correspond to numbers m of symbols used.

Figure 5. Accuracy of fitted values on training data (top row) and of predicted values on testing data (bottom row) for three types of models plotted as functions of information in the training data. Theoretical curves are plotted using the inverse of function (6) for

c = d = 1 / 2

and

p = 1 / 2

. Different curves correspond to the number m of symbols used.

Figure 5. Accuracy of fitted values on training data (top row) and of predicted values on testing data (bottom row) for three types of models plotted as functions of information in the training data. Theoretical curves are plotted using the inverse of function (6) for

c = d = 1 / 2

and

p = 1 / 2

. Different curves correspond to the number m of symbols used.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Belavkin, R.; Pardalos, P.; Principe, J. Value of Information in the Binary Case and Confusion Matrix. Phys. Sci. Forum 2022, 5, 8. https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005008

AMA Style

Belavkin R, Pardalos P, Principe J. Value of Information in the Binary Case and Confusion Matrix. Physical Sciences Forum. 2022; 5(1):8. https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005008

Chicago/Turabian Style

Belavkin, Roman, Panos Pardalos, and Jose Principe. 2022. "Value of Information in the Binary Case and Confusion Matrix" Physical Sciences Forum 5, no. 1: 8. https://0-doi-org.brum.beds.ac.uk/10.3390/psf2022005008

Article Menu

Value of Information in the Binary Case and Confusion Matrix^†

Abstract

1. Introduction

2. Value of Shannon’s Information for the $2 \times 2$ System

3. Application: Accuracy of Time-Series Forecasts

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Value of Information in the Binary Case and Confusion Matrix †

Abstract

1. Introduction

2. Value of Shannon’s Information for the 2 × 2 System

3. Application: Accuracy of Time-Series Forecasts

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Value of Information in the Binary Case and Confusion Matrix^†

2. Value of Shannon’s Information for the $2 \times 2$ System