To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Statistical data type

From Wikipedia, the free encyclopedia

In statistics, groups of individual data points may be classified as belonging to any of various statistical data types, e.g. categorical ("red", "blue", "green"), real number (1.68, −5, 1.7×10+6), odd number (1,3,5) etc. The data type is a fundamental component of the semantic content of the variable, and controls which sorts of probability distributions can logically be used to describe the variable, the permissible operations on the variable, the type of regression analysis used to predict the variable, etc. The concept of data type is similar to the concept of level of measurement, but more specific: For example, count data require a different distribution (e.g. a Poisson distribution or binomial distribution) than non-negative real-valued data require, but both fall under the same level of measurement (a ratio scale).

Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in degree Celsius or degree Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.

Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.

Other categorizations have been proposed. For example, Mosteller and Tukey (1977)[1] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990)[2] described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998),[3] van den Berg (1991).[4]

The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer" (Hand, 2004, p. 82).[5]

YouTube Encyclopedic

  • 1/3
    Views:
    1 991 859
    148 581
    815 892
  • Types of Data: Nominal, Ordinal, Interval/Ratio - Statistics Help
  • Nominal, Ordinal, Interval & Ratio Data: Simple Explanation With Examples
  • Qualitative and Quantitative

Transcription

Types of data: Nominal Ordinal Interval/Ratio Data is central to statistical analysis When we wish to find out more about a phenomenon or process we collect data. Usually we collect several measures on each person or thing of interest. Each thing we collect data about is called an observation. If we are interested in how people respond, then each observation will be a person. OR an observation could be a business or a product, or a period in time, such as a week. Variables record the measurements we are interested in. Age, sex and chocolate preference can all be stored as variables. For each observation we record a score or value for each of the variables. When we store this data in a spreadsheet or database, each row corresponds to a single observation and each column is a variable. Level of measurement The level of measurement used for a variable determines which summary statistics, graphs and analysis are possible and sensible. The Nominal level is the most basic level of measurement. Nominal is also known as categorical or qualitative. Examples of nominal variables are sex, preferred type of chocolate and colour. These are descriptions or labels with no sense of order. Nominal values can be stored as a word or text or given a numerical code. However, the numbers do not imply order. To summarise nominal data we use a frequency or percentage. You can not calculate a mean or average value for nominal data. The next level of measurement is ordinal. Examples of ordinal variables are rank, satisfaction, and fanciness! Ordinal variables have a meaningful order, but the intervals between the values in the scale may not be equal. For example the gap between first and second runners in a race may be small, whereas there is a bigger gap between second and third. Similarly there may be a big difference between satisfied and unsatisfied, but a smaller difference between unsatisfied and very unsatisfied. Like Nominal data, ordinal data can be given as frequencies. Some people state that you should never calculate a mean or average for ordinal data. However it is quite common practice, particularly in research regarding people's behaviour to find mean values for ordinal data. You should be careful if you do this to think about what it means and if it is justifiable. The most precise level of measurement is interval/ratio. This label includes things that can be measured rather than classified or ordered, such as number of customers weight, age and size. Interval ratio data is also known as scale, quantitative or parametric. Interval/Ratio data can be discrete, with whole numbers or continuous, with fractional numbers. Interval/Ratio data is very mathematically versatile. The most common summary measures are the mean, the median and the standard deviation. The way data should be represented in a graph or chart depends on the level of measurement. Nominal data can be displayed as a pie chart, column or bar chart or stacked column or bar chart. In most cases the best choice for a single set of nominal data is a column chart. Ordinal data must not be represented as a pie chart, but is best shown as a column or bar chart. Interval/ratio data is best represented as a bar chart or a histogram. For these the data is grouped. Box plots illustrate the summary statistics for a variable in a neat way. Data which occurs over time is best displayed as a line chart. Here is an example using different types of data. Helen sells choconutties. Helen is interested in developing a new product to add to her line of choconutties. She develops a questionnaire and asks a random sample of 50 of her customers to fill it out. She asks them their age and sex, how much they spend on groceries each week, how many chocolate bars they buy in a week, and which they like best out of dark, milk and white chocolate. She asks them how satisfied they are with choconutties: very satisfied, satisfied, not satisfied, very unsatisfied. And she asks them how likely they are to buy a whole box of 10 packets of choconutties. Helen enters the data in a spreadsheet. Each row has responses from one customer. Each column contains the measurements or scores for one variable. The type of chocolate preferred is nominal data. This can be shown in a pie chart or bar chart. We can summarise by saying that 46% of customers prefer Dark chocolate, 40% prefer milk chocolate, and 14% prefer white chocolate. The measures of satisfaction and likelihood are ordinal level data. These should not be shown in a pie chart. The values should be put in a logical order in a column chart. We could say that 32% are very satisfied with choconutties and 72% of people are satisfied or very satisfied. and 72% of people are satisfied or very satisfied. The average satisfaction score comes to 2.06, which could be interpreted as satisfied. However it is debatable whether it is sensible to calculate a mean satisfaction score. Age, amount spent on groceries and number of chocolate bars are all interval/ratio data. These can be displayed on bar charts or histograms. We can say that for the customers in the sample, the mean age is 38 years, the mean amount spent on groceries is $192, and the mean number of chocolate bars bought per week is 3.3. These are all meaningful summary statistics. The type of analysis that is sensible for a given dataset depends on the level of measurement. You can find out more about this in the video, "Choosing the test".

Simple data types

The following table classifies the various simple data types, associated distributions, permissible operations, etc. Regardless of the logical possible values, all of these data types are generally coded using real numbers, because the theory of random variables often explicitly assumes that they hold real numbers.

Data Type
Possible values Example usage
Level of
measurement
Distribution
Scale of
relative
differences
Permissible statistics Regression analysis
0, 1 (arbitrary labels) binary outcome ("yes/no", "true/false", "success/failure", etc.) Bernoulli mode, chi-squared logistic, probit
"name1", "name2", "name3", ... "nameK" (arbitrary labels) categorical outcome with names or places like "Rome", "Amsterdam", "Madrid", "London", "Washington" (specific blood type, political party, word, etc.) categorical multinomial logit, multinomial probit
ordering categories or integer or real number (arbitrary scale) Ordering adverbs like "Small", "Medium", "Large", relative score, significant only for creating a ranking categorical
relative
comparison
ordinal regression (ordered logit, ordered probit)
0, 1, ..., N number of successes (e.g. yes votes) out of N possible binomial, beta-binomial, etc.
additive
mean, median, mode, standard deviation, correlation binomial regression (logistic, probit)
nonnegative integers (0, 1, ...) number of items (telephone calls, people, molecules, births, deaths, etc.) in given interval/area/volume Poisson, negative binomial, etc.
multiplicative
All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation Poisson, negative binomial regression
real-valued
additive
real number temperature in degree Celsius or degree Fahrenheit, relative distance, location parameter, etc. (or approximately, anything not varying over a large scale) normal, etc. (usually symmetric about the mean)
additive
mean, median, mode, standard deviation, correlation standard linear regression
real-valued
multiplicative
positive real number temperature in kelvin, price, income, size, scale parameter, etc. (especially when varying over a large scale) log-normal, gamma, exponential, etc. (usually a skewed distribution)
multiplicative
All statistics permitted for interval scales plus the following: geometric mean, harmonic mean, coefficient of variation generalized linear model with logarithmic link

Multivariate data types

Data that cannot be described using a single number are often shoehorned into random vectors of real-valued random variables, although there is an increasing tendency to treat them on their own. Some examples:

  • Random vectors. The individual elements may or may not be correlated. Examples of distributions used to describe correlated random vectors are the multivariate normal distribution and multivariate t-distribution. In general, there may be arbitrary correlations between any elements and any others; however, this often becomes unmanageable above a certain size, requiring further restrictions on the correlated elements.
  • Random matrices. Random matrices can be laid out linearly and treated as random vectors; however, this may not be an efficient way of representing the correlations between different elements. Some probability distributions are specifically designed for random matrices, e.g. the matrix normal distribution and Wishart distribution.
  • Random sequences. These are sometimes considered to be the same as random vectors, but in other cases the term is applied specifically to cases where each random variable is only correlated with nearby variables (as in a Markov model). This is a particular case of a Bayes network and often used for very long sequences, e.g. gene sequences or lengthy text documents. A number of models are specifically designed for such sequences, e.g. hidden Markov models.
  • Random processes. These are similar to random sequences, but where the length of the sequence is indefinite or infinite and the elements in the sequence are processed one-by-one. This is often used for data that can be described as a time series, e.g. the price of a stock on successive days. Random processes are also used to model values that vary continuously (e.g. the temperature at successive moments in time), rather than at discrete intervals.
  • Bayes networks. These correspond to aggregates of random variables described using graphical models, where individual random variables are linked in a graph structure with conditional distributions relating variables to nearby variables.
  • Random fields. These represent the extension of random processes to multiple dimensions, and are common in physics, where they are used in statistical mechanics to describe properties such as force or electric field that can vary continuously over three dimensions (or four dimensions, when time is included).

These concepts originate in various scientific fields and frequently overlap in usage. As a result, it is very often the case that multiple concepts could potentially be applied to the same problem.

References

  1. ^ Mosteller, F.; Tukey, J.W. (1977). Data analysis and regression. Addison-Wesley. ISBN 978-0-201-04854-4.
  2. ^ Nelder, J.A. (1990). "The knowledge needed to computerise the analysis and interpretation of statistical information". Expert systems and artificial intelligence: the need for information about data. London: Library Association. OCLC 27042489.
  3. ^ Chrisman, Nicholas R. (1998). "Rethinking Levels of Measurement for Cartography". Cartography and Geographic Information Science. 25 (4): 231–242. doi:10.1559/152304098782383043.
  4. ^ van den Berg, G. (1991). Choosing an analysis method. Leiden: DSWO Press. ISBN 978-90-6695-062-7.
  5. ^ Hand, D.J. (2004). Measurement theory and practice: The world through quantification. Wiley. p. 82. ISBN 978-0-470-68567-9.
This page was last edited on 18 May 2024, at 10:43
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.