STATISTICS AND PROBABILITY

Rojinvarghese
5 min readNov 30, 2020

ANALYZING CATEGORICAL VARIABLE:

A categorical variable is one that has two or more categories, but there is no intrinsic ordering to the categories.

Analyze one categorical variable:

In Statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

Analyze two categorical variable:

A one-way analysis of variance (ANOVA) is used when you have a a categorical independent variable (with two or more categories) and a normally distributed interval dependent variable and you wish to test for difference in the means of the dependent variable broken down by the levels of the independent variable.

Chi-Square test:

Chi-squared test for nominal (categorical) data. The c² test is used to determine whether an association (or relationship) between 2 categorical variables in a sample is likely to reflect a real association between these 2 variables in the population.

DISPLAYING AND COMPARING QUANTITATIVE DATA:

A bar chart or pie chart is often used to display categorical data. These types of displays, however, are not appropriate for quantitative data. Quantitative data is often displayed using either a histogram, dot plot or a stem and leaf plot.

In a histogram, the interval corresponding to the width of each bar is called a bin. A histogram displays the bin counts as the height of the bars (like a bar chart). Unlike a bar chart, however, the bars in a histogram touch one another. An empty space between bars represents a gap in data values. If a value falls on the border between two consecutive bars, it is placed in the bin on the right.

A relative frequency histogram displays the proportion of cases in each bin instead of the count.

Histogram are useful when working with large sets of data, and they can easily be constructed using a graphing calculator. A disadvantage of histogram is that they do not show individual values.

A stem and leaf plot is similar to a histogram, but it shows individual values rather than bars. It may be necessary to split stems if the range of data values is small. A back to back stem and leaf can be useful when comparing two distribution.

SUMMARIZING QUANTITIVE DATA:

Summary statistics summarize and provide information about your sample data. It tells you something about the values in your data set. This includes where the average lies and whether your data is skewed.

MODELLING DATA DISTRIBUTION:

BIVARIATE NUMERICAL DATA:

In statistics, bivariate data is data on each of two variable, where each value of one of the variable is paired with a value of the other variable. Typically it would be of interest to investigate the possible association between the two variables. The association can be done studied via a tabular or graphical display or sample statistics which might be used for inference.

ANALYZE BIVARIATE ANALYSIS:

Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, used to find out if there is relationship between two sets of values. It usually involves the variables X and Y.

PROBABILITY:

Probability is simply how likely something is to happen. Whenever we are unsure about the outcome of an event, we can talk about the probabilities of certain outcomes- how likely they are. The analysis of event governed by probability is called statistics.

COUNTING, PERMUTATION AND COMBINATION:

The fundamental counting principle states that if one event has m possible outcomes and a 2nd event has n possible outcomes, then there are m.n total possible outcomes for the two events together. A combination is the number of ways of choosing objects from total of objects (order does not matter).

RANDOM VARIABLE:

A random variable usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous.

Discrete is having specific values.

Continuous is any value in continuous range.

SAMPLING DISTRIBUTION:

A sampling distribution is a probability distribution of a statistics obtained from a larger number of samples drawn from specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistics of a population.

CONFIDENCE INTERVAL:

The confidence interval is a range of values that’s likely to include a population value with certain degree of confidence. It is often expressed a % whereby a population means lies between an upper and lower interval.

SIGNIFICANCE TEST (HYPOTHESIS TESTING):

A test of significance is a formal procedure for comparing observed data with a claim (also called a hypothesis), the truth of which is being assessed. The claim is a statement about a parameter, like the population proportion p or the population mean meu.

INFERENCE FOR CATEGORICAL DATA:

A chi square test of independence determines whether there is an association between categorical variable. It is a nonparametric test. This test is also known as chi square test of association.

ANALYSIS OF VARIANCE (ANOVA):

The one way analysis of variance is used to determine whether there are any statistically differences between the means of two or more independent (unrelated) groups (although you tend to only see it used when there are a minimum of three, rather than two groups).

Analysis of variance (ANOVA) is the most powerful analytic tool available in statistics. It splits an observed aggregate variability that is found inside the data set. Then separate the data into systematic factors and random factors. In the systematic factor, that data set has statistical influence. On the other hand, random factors don’t have this feature. The analyst uses the ANOVA to determine the influence that the independent variable has on the dependent variable. With the use of Analysis of Variance (ANOVA), we test the differences between two or more means. Most of the statisticians have an opinion that it should be known as “Analysis of Means.” We use it to it test the general rather than to find the difference among means. With the help of this tool, the researchers can able to conduct many tests simultaneously.

The Formula for ANOVA

F= MSE/MST

where:

F=ANOVA coefficient

MST=Mean sum of squares due to treatment

MSE=Mean sum of squares due to error

Types of ANOVA

There are two types of ANOVA.

One-way ANOVA

One way ANOVA is the unidirectional ANOVA. In this ANOVA, there are sole response variables as compared with the two-way ANOVA. It evaluates the impact of a sole factor. And this factor is determined that the samples are the same or not. Besides, it is also used to determine that there is any statistically significant difference between the mean of three or more independent groups.

Two-way ANOVA

A two-way ANOVA is the extended version of the one-way ANOVA. In two-way ANOVA, you will have two independents. It utilizes the interaction between the two factors. And these tests have the effect of two factors at the same time. In this ANOVA, the statistical test is used to determine the effect of two nominal predictor variables on a continuous outcome variable.

--

--