Probability concepts for Machine learning

Dilip Kumar
10 min readJul 9, 2023

Regular variable

Regular variable value doesn’t change once value is assigned. Following are few examples.

x = 3
z = 4
age = 44

Random variable

A random variable, X, is a quantity that can have different values each time the variable is inspected, such as in measurements in experiments. Following are few examples.

X(value after rolling the dice)

Here, every time value could be anything from the finite set of {1, 2, 3, 4, 5, 6}

X(light sensor reading)

Here, due to the temperature and internal circuit problem, value of light reading will be different. Could be any value in the infinite range of 39.xx.

Types of Random variable

There are two types of random variable

  1. Discrete random variable
  2. Continuous random variable

Discrete random variable

A discrete random variable is one which may take on only a countable number of distinct values.

For example, rolling dice can have values from the set {1, 2, 3, 4, 5, 6}.

Continuous random variable

A continuous random variable is one which takes an infinite number of possible values.

For example, measuring light sensors can have any infinite value if you are using a machine which can detect with high precision.

Predicting random variable value

Due to nature of random variable, it is impossible to predict the value.

Predicting randomness

While we might not be able to predict a specific value, it is often the case that some values might be more likely than others. We might be able to say something about how often a certain number will appear when drawing many examples.

Probability Density (or Distribution) function

How likely each value is for a random variable x, is captured by the probability density function pdf(x) in the continuous case and by the probability mass function P(x) in the discrete case.

Probability mass function for discrete random variable

Let’s take example of tossing coin three times and random variable X represents the number of head after three toss.

Probability of following head count.

Following is plot to draw Probability mass function for our random variable X.

Probability density function for continuous random variable

Let’s take example of random variable Y to find out the exact amount of rain tomorrow.

Since it is a continous random variable therefore asking follwing question is invalid.

What is probability to rain 2inches tomorrow? I.e find out Y(2).

This is because due to continous nature, we can’t say exact 2. It should be asked in the range of 2.00000 to 2.00001. Or simply close to 2.

Probability for given range is defined as area of PDF in that range.

Probability Theory

Probability theory is the theory of random numbers. We denote such numbers by capital letters to distinguish them from regular numbers written in lowercase.

We can formalize the idea of expressing probabilities of drawing specific values for a random variable with some compact notations.

Probability theory for discrete random variable

The probability mass function for discrete random numbers is as follows:

P(x)=P(X=x)

This describes the probability with which each possible value x of a discrete variable X occurs. Note that x is a regular variable, not a random variable. The value of P(x) predicts the fraction of times we get a value x for the random variable X, if we draw many examples of the random variable.

In case of dice rolling example, we can write following

P(1) = 1/6

P(2) = 1/6

Sum of probabilities for all outcomes is equal to 1, which is an important normalization condition for the following probability function.

Probability theory for continuous random variable

We have an infinite number of possible values x so that the fraction for each number becomes formally infinitesimally small. It is thus necessary to write the probability distribution function using calculus differential equation as below

P(x)=p(x)dx

Here p(x) is the probability density function (PDF).

The sum of all probabilities becomes an integral, and the normalization condition for a continuous random variable as below.

In case of finite probability, sum of all probabilities will be integral of given range

Write discrete variable in the form of continuous variable

Use the delta function to write discrete random processes in a continuous form.

For example, here we have the discrete density function for throwing a dice:

This can be written as a density function as follows:

Mean for random variable

Mean is the average value of the random sample when drawing many examples. This is defined as below.

If we don’t know PDF then we can define an approximation of the mean from measurements as below.

Here pi​ is the frequency of xi in the given interval.

Median for random variable

The median value is the value for the random variable for which it is equally likely to find a value lower or larger than this value. This is shown below:

Variance of random variable

The spread of the PDF around the mean gives us a sense of how distributed the values are. This spread is often characterized by the standard deviation (STD), or its square, which is called variance, σ2. This is defined as follows:

This is mathematically called a second moment of the distribution, whereas the mean is the first moment. nth moment about the mean is defined as follows:

If PDF is not given, the variance can be calculated from data follows:

Types of Probability Density functions

Bernoulli Distribution

A Bernoulli random variable is a variable from an experiment that has two possible outcomes:

  • A success with probability p
  • A failure with probability (1 − p)

Probability function is defined as below:

P(success)=p
P(failure)= 1−p

Mean: p

Variance: p(1−p)

Multinomial Distribution

This is the distribution of outcomes in n trials that have k possible outcomes. The probability of each outcome is therefore pi​.

Probability function is defined as below:

mean: npi
variance: npi​(1−pi​)

Binomial Distribution

An important example is a binomial distribution (k=2), which describes the number of successes in n Bernoulli trials with probability of success p. Note that the binomial coefficient is defined as follows:

Probability function is defined as below:

mean: np
variance: np(1−p)

Normal Distribution (Also called Gaussian Distribution)

The limit of the binomial distribution for a large number of trials depends on two parameters, the mean μ, and the standard deviation σ.

Probability density function:

mean: μ
variance: 2σ2

Multi variate normal distribution

A density functions with several random variables, x1​,…,xn​ is called multivariate density function. Following is sample example.

This is a straight-forward generalization of the 1-dimensional Gaussian distribution mentioned above, where the mean is now a vector μ, and the variance generalizes to a covariance matrix as below.

which must be symmetric and positive semidefinite.

Cumulative Probability Density function

Probability of having a value x for the random variable X in the range of x1​≤xx2​ is given by the following equation:

Calculate the probability that a normally (Gaussian) distributed variable has values between x1​=0 and x2​=y.

Here erf is the Gaussian error function. This Gaussian error function for normally distributed variables (Gaussian distribution with mean μ=0 and variance σ=1) is commonly tabulated in books on statistics.

Another important general case is when x1​ in the equation is equal to the lowest possible value of the random variable (usually −∞).

The integral equation then corresponds to the probability that a random variable has a value smaller than a certain value, say y.

This function of y is called the cumulative density function (CDF)

Probability density function for multi variate

Let’s use multivariate concepts for two variables. The total knowledge about the co-occurrence of specific values for two random variables X and Y is captured by the following equation:

The slice of this function, given the value of one variable, say y, is shown below:

A conditional PDF is also illustrated in the figure below. If we sum over all realizations of y, we get the following:

The Chain rule

A joint distribution can be decomposed into the product of a conditional and a marginal distribution as below.

This is easily generalized to n random variables by the following equation:

Independent Random Variables

A random variable X is independent of Y if the following holds:

Using the chain rule , we can write join distribution as below

This means that the joint distribution of two independent random variables is the product of their marginal distributions.

Conditional Independent Random Variables

We can also define conditional independence. For example, two random variables, X and Y, are conditionally independent of random variable Z if the following holds:

Bay’es theorem

Steve is most likely a librarian or a farmer?

About Steve:

Steve is a shy, introvert with little interest with people and world reality. He is a hardworking and passion to finish his goal.

Question: Is Steve most likely a Librarian or a Farmer?

Biases in people’s decision and not considering the ratio in the judgement

People are generally bias while making some prediction.

But they don’t consider the ratio of Librarian and Farmer in their judgement. Let’s assume that there are 1:20 ratio of Librarian and Farmers.

Let’s say 40% of Librarian and 10% Farmers fits into given description. I.e. 4 librarian and 20 farmers fits into that description.

It means, probability of a random person fits into Librarian would be

Observation: Even the description fit as Librarian was 4 times higher than farmer the probability came lower. This is because there are way more farmers compare to librarian.

Heart of Baye’s theorem: The new evidence doesn’t completely determine your belief system, it updates your prior belief.

When to use Baye’s theorem?

You have some hypothesis (here Steve is a librarian) and you have some evidence (the given text paragraph). You want to know the probability of your hypothesis hold given the evidence is true.

Baye’s theorem formula

Easy way to remember is the following diagram then deduce the formula.

Baye’s rule

This theorem is important because it tells us how to combine prior knowledge over a random variable we want to estimate, p(x), with the likelihood p(yx) of data y given x.

Markov chain Monte Carlo (MCMC) method

MCMCMCMC is a method to calculate posterior distributions without the need to evaluate the denominator, also called the partition function, which is usually intractable. This can be shown as follows:

Reference

--

--