


The histogram shows the distribution of the simulated sample. Keylegend "kernel" "normal" / location=inside position=topright across= 1 If contaminate then x = rand ( "Normal", 0, &lambda ) ĭensity x / type= normal (mu= 0 sigma= 1 ) name= "normal" įringe x / group=contaminate lineattrs= (thickness= 2 ) %let alpha= 0.1 /* level of contamination */ %let lambda = 3 /* magnitude of contamination */ %let N = 100 /* size of sample */ data CNRand ( keep= x contaminate ) Ĭontaminate = rand ( "Bernoulli", &alpha )

The algorithm is a special case of simulating from a mixture distribution.Ĭhoose a component with probability α and then generate a value from whichever component is chosen, as follows: 120) provides an algorithm for simulating data from a contaminated normal distribution. The book Simulating Data with SAS (2013, p. Random samples from the contaminated normal distribution
#Cdf of normal distribution series#
Series x= x y=CN / lineattrs= (thickness= 3 ) Īs shown in the graph, the contaminated normal distribution (shown with a thick line) has heavier tails than the "uncontaminated" normal component. Title2 "alpha = &alpha lambda = &lambda" Title "Contaminated Normal Distribution"
#Cdf of normal distribution pdf#
Y2 = pdf ( "Normal", x, 0, 1 *&lambda) /* contamination */ĬN = ( 1- &alpha ) *Y1 + &alpha*Y2 /* contaminated normal */ output Y1 = pdf ( "Normal", x, 0, 1 ) /* std normal component */ The call to the SGPLOT procedure plots the density and the component densities: The following SAS DATA step constructs the density of a contaminated normal distribution as the linear combination of a N(0,1) and a N(0,3) density. The density of the contaminated normal distribution In the following sections, μ=0 and σ=1, so that the uncontaminated component is the standard normal distribution. Tukey reports that when λ=3 and α=0.1, "the two constituents contribute equal amounts to the variance of the contaminated distribution." This article uses α = 0.1, which represents 10% "contamination." Tukey (1960) uses λ=3 as a scale multiplier. The idea is that the "main" distribution (φ(x μ, σ)) is slightly "contaminated" by a wider distribution. Is also a convenient way to generate data with outliers."Ī contaminated normal distribution is a mixture of two normal distributions with mixing probabilities (1 - α) and α, where typically 0 1 is a parameter that determines the standard deviation of the wider component. This results in a distribution with heavier tails than normality. 119), "the contaminated normalĭistribution is a specific instance of a two-component mixture distribution in which both componentsĪre normally distributed with a common mean. The contaminated normal distibution was originally studied by John Tukey in the 190s and '50s.Īs I say in my book Simulating Data with SAS (2013, p. What is a contaminated normal distribution? The distribution is easy to explain and understand, and it is also easy to implement in SAS. The contaminated normal distribution is a simple but useful distribution you can use to simulate outliers. Assuming that the test scores are normally distributed, the probability can be calculated using the output of the cumulative distribution function as shown in the formula below.How can you generate data that contains outliers in a simulation study? Instead, it is reasonable to compute the probability of the student scoring between 90% and 95% on the test. The area under the function represents the probability of an event occurring in that range. For example, the probability of a student scoring exactly 93.41% on a test is very unlikely. Probability density functions model problems over continuous ranges. For example, 68.3% of the area will always lie within one standard deviation of the mean. The area under the normal distribution is always equal to 1 and is proportional to the standard deviation as shown in the figure below. The standard deviation represents how spread out around the distribution is around the mean. The mean represents the center or "balancing point" of the distribution. The normal PDF is a bell-shaped probability density function described by two values: the mean and standard deviation. If the cumulative flag is set to FALSE, the return value is equal to the value on the curve. If the cumulative flag is set to TRUE, the return value is equal to the area to the left of the input. The output of the function is visualized by drawing the bell-shaped curve defined by the input to the function.
