About

Welcome to Intelligent-Machines.info. This blog is dedicated to artificial intelligence and machine learning, and focusses on applications in business, science and every-day life. It is maintained by Damien François. Enjoy your visit!

Links

Machine Learning Thoughts
Predict This!
Lets learn, shall we ?
Togelius
Artificial Intelligence 2.0
Data Mining in Matlab
The Aspiring* Dr. Coppersmith
Genetic Argonaut
MEDAL blogging
Michael Orlov's blog
PHP/Math Project
Michael Williams' blog
Machines Like Us
Know Thyself
Al Fin's blog
Developing Intelligence
One R Tip A Day
Random Activity Posting
Data mining research
Vilot
Life analytics
Positively 4th street
A beautiful WWW
About Intelligence
Onionesque Reality
Random Ramblings..
IEEE portal about AI
Apache Lucene Machine Learning page

Search

Misc.

Powered by Pivot - 1.40.6: 'Dreadwind' 
XML: RSS Feed 
XML: Atom Feed 

Data normalization for statistical analysis

Friday 29 January 2010 at 2:04 pm In data mining and statistical data analysis, data need to be prepared before models can be built or algorithms can be used. In this context, preparing the data means transforming them prior to the analysis so as to ease the algorithm's job. Often, the rationale will be to alter the data so that the hypotheses, on which the algorithms are based, are verified, while at the same time preserving their information content intact. One of the most basic transformation is normalisation.

What is normalisation

The term normalisation is used in many contexts, with distinct, but related, meanings. Basically, normalizing means transforming so as to render normal. When data are seen as vectors, normalising means transforming the vector so that it has unit norm. When data are though of as random variables, normalising means transforming to normal distribution. When the data are hypothesized to be normal, normalising means transforming to unit variance.

Let us consider data as a table where each row corresponds to an observation (a data element), and each row corresponds to a variable (an attribute of the data). Let us furthermore assume that each data element has a response value (target) associated to it (i.e. we focus on supervised learning.)

Variable (column) normalisation

Why column normalisation ? The simple answer is so that variables can be compared fairly in terms of information content with respect to the target variable. This issue is most important for algorithms and models that are based on some sort of distance, such as the Euclidean distance. As the Euclidean distance is computed as a sum of variable differences, its result greatly depends on the ranges of the variables. Should a variable express a dynamic (or variance), say 100 times larger than the others, than its value will mostly dictate the value of the distance, merely ignoring the values of the other variables. Should those variables be of some importance, the distance would be merely useless in any algorithm.

To avoid the latter situation, variables (columns) are normalised to the same 'dynamic range', with no units (they become a-dimensional values). In practice the way the normalisation is handled depends on the hypotheses made (or, as a matter of fact, on the personal experience of the practitioner).

Alternative 1: variables are supposed normally distributed with distinct means and variances.

In such case, the idea is to centre all variables so they have a zero mean and divide them by their standard deviation so that they all express unit variance. The transformed variables are then what are called 'z-scores' in the statistical literature. They are expressed in 'number of standard deviations' in the original data. Most of the transformed values lie within the [-1, 1] interval.

Alternative 2: variables are supposed uniformly distributed with distinct ranges.

Then, the idea is to level all variables to the same minimum (e.g. zero) and maximum (e.g. one). The transformed values then of course lie in the interval [0, 1] and are expressed as percentages of the original range.

Alternative 3: no hypothesis is assumed.

When no hypothesis is made, the solution is to replace the original values with its percentile in the original variable distribution. The data are then squashed, in a non-linear way, between zero and one, based on the inverse cumulative distribution of the each variable.

Element (row) normalisation

Why row normalisation ? While column normalisation can be applied to any data table, row normalisation makes sense only when all variables are expressed in the same unit. This is often the case for instance with functional data, that is data that come from the discretisation of some function. Row normalisation makes sense in such context when the measurements are prone to measurement bias, or when the information lies in relative measurements rather than in absolute ones. Then, the same normalisation procedures can be applied as for column normalisation. Often, the mean and the variance, or the maximum and the minimum, of each data elemnt are added as extra variables, prior to the analysis.

Target normalisation


Why target normalisation ? Because building a model between the data elements and their associated target is made easier when the set of values to predict is rather compact. So when the distribution of the target variable is skewed, that is there are many lower values and a few higher values (e.g. the distribution of income ; the income is non-negative, most people are earn around the average, and few people make bigger money), it is preferable to transform the variable to a normal one by computing its logarithm. Then the distribution becomes more even.

Summary

Normalisation is a procedure followed to bring the data closer to the requirements of the algorithms, or at least to pre-process data so as to ease the algorithm's job. Variables can be normalised (to unit zero mean and unit variable, or to the interval [0, 1]), data elements can be normalised (when all their attributes have the same 'units') and the target variable can be normalised too (using a logarithmic transform). The choice to do or not to do normalization is of course left to the practitioner, but it can be advised with virtually no risk to always perform variable normalisation to [0, 1] when the variable values are bounded, to zero mean and unit variance otherwise, and to perform log transform of the target whenever it is skewed.

A supervised model performance cheatsheet

Thursday 29 October 2009 at 08:59 am Whenever a prediction model is built, its performances must be estimated so as to grasp an idea of how accurate the model is. The fact is that many different measures have been proposed and used inconsistently, sometimes making it difficult to compare models. I have put together a list of the most common ones, along with their definition/equation to serve as a handy reminder, in the spirit of cheatsheets.

You can download it from here. Do not hesitate to email me any comment you might have about it.


P.S. the link was working from the front page only ; I have just corrected that (Thanks Kevin and Alex)

Follow me on Twitter

Thursday 24 September 2009 at 3:30 pm I regularly tweet about machine learning applications I find on the web and other AI-related web pages.

http://twitter.com/damienfrancois

What artificial intelligence can achieve

Monday 26 January 2009 at 1:13 pm The time when artificial intelligence will make robots more intelligent than we are, has not arrived yet. Artificial intelligence is however more than a dream for illuminated scientists; it is a very active and broad research field from which many useful tools for solving problems have arisen.
The applications where artificial intelligence can help are broadly divided into three categories, that are detailed hereafter... (more)

Machine learning videos

Monday 26 January 2009 at 1:10 pm For people interested in leaning about machine learning, here is a list of websites where you can find video lectures on topics related to machine learning.

The main reference are www.videolectures.net, which has a section dedicated to videos from the PASCAL network, and AAAI.org.

On delicious, pskomoroch has compiled a huge list of videos in various domains, including machine learning. And of course, there's always Google video..

Some blogs also link to videos. Free Science and video lectures online! has compiled a list of video lecture given at Machine Learning Summer School 2003, 2005 and 2006. Most videos come from www.videolecture.com. Data Wrangling proposes a large list of Hidden Video Courses in Math, Science, and Engineering, and Business Intelligence, Data Mining & Machine Learning a list of Machine Learning OnLine Lectures. Olivier Bousquet also has a page dedicated to Machine Learning Videos. On Cgkt’s Weblog, you can find Video Lectures on Probabilistic Graphical Models, as well as on LectureFox; free university lectures » mathematics

Finally note that Berkeley and the MIT also publish online videos of lectures.

Feel free to comment and/or add more sources!

For less advanced lectures, here is a link to a video by Tom Mitchell, author of one of the founding reference book in machine learning. He is the head of the Machine Learning Department at Carnegie Mellon University. The video is aimed at people not knowing the field of machine learning and contains very few technical contents. Interested readers may also consider reading hiw white paper introducing Machine Learning.

Linkdump

> Stanford's robotic Audi to brave Pikes Peak without a driver (w/ Video)

> Microsoft's Hands-Free Answer to the Nintendo Wii

> Evaluating with Unbalanced Training Data

> Recommendation system that identifies a valuable user action by mining data ... Microsoft patent

> Take a Second Shot at Understanding Math

Tag cloud

(all)