Chemometrics - Data Driven Extraction for Science

Chemometrics - Data Driven Extraction for Science

von: Richard G. Brereton

Wiley, 2018

ISBN: 9781118904688 , 464 Seiten

2. Auflage

Format: ePUB

Kopierschutz: DRM

Mac OSX,Windows PC für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Apple iPod touch, iPhone und Android Smartphones

Preis: 91,99 EUR

eBook anfordern eBook anfordern

Mehr zum Inhalt

Chemometrics - Data Driven Extraction for Science


 

Chapter 1
Introduction


1.1 Historical Parentage


There are many opinions about the origin of chemometrics. Until quite recently, the birth of chemometrics was considered to have happened in the 1970s. Its name first appeared in 1972 in an article by Svante Wold [1]: in fact, the topic of this article was not one that we would recognise as being core to chemometrics, being relevant to neither multivariate analysis nor experimental design. For over a decade, the word chemometrics was considered to be of very low profile, and it developed a recognisable presence only in the 1980s, as described below.

However, if an explorer describes a new species in a forest, the species was there long before the explorer. Thus, the naming of the discipline just recognises that it had reached some level of visibility and maturity. As people re-evaluate the origins of chemometrics, the birth can be traced many years back.

Chemometrics burst into the world due to three fundamental factors, applied statistics (multivariate and experimental design), statistics in analytical and physical chemistry, and scientific computing.

1.1.1 Applied Statistics


The ideas of multivariate statistics have been around a long time. R.A. Fisher and colleagues working in Rothamsted, UK, formalised many of our modern ideas while applying primarily to agriculture. In the UK, before the First World War, many of the upper classes owned extensive land and relied on their income from tenant farmers and agricultural labourers. After the First World War, the cost of labour became higher, with many moving to the cities, and there was stronger competition of food from global imports. This meant that historic agricultural practices were seen to be inefficient and it was hard for landowners (or companies that took over large estates) to be economic and competitive, hence a huge emphasis on agricultural research, including statistics to improve these. R.A. Fisher and co-workers published some of the first major books and papers that we would regard as defining modern statistical thinking [2, 3], introducing ideas ranging from the null hypothesis to discriminant analysis to ANOVA. Some of the work of Fisher followed from the pioneering work of Karl Pearson in the University College London who had founded the world's first statistics department previously and had first formulated ideas such as p values and correlation coefficients.

During the 1920s and 1930s, a number of important pioneers of multivariate statistics published their work, many strongly influenced or having worked with Fisher, including Harold Hotelling, credited by many as defining principal components analysis (PCA) [4], although Pearson had independently described this method some 30 years ago, but under a different guise. As so often ideas are reported several times over in science, it is the person that names it and popularises it that often gets the credit: in the early twentieth century, libraries were often localised and there were very few international journals (Hotelling working mainly in the US) and certainly no internet; therefore, parallel work was often reported.

The principles of statistical experimental design were also formulated at around this period. There had been early reports on what we regard as modern approaches to formal designs before that, for example James Lind's work on scurvy in the eighteenth century and Charles Pierce's discussion on randomised trials in the nineteenth century, but Fisher's classic work of the 1930s put all the concepts together in a rigorous statistical format [5].

Much non-Bayesian, applied statistical thinking has been based on principles established in the 1920s and 1930s, for nearly a century. Early applications include agriculture, psychology, finance and genetics. After the Second World War, the chemical industry took an interest. In the 1920s, an important need was to improve agricultural practice, but by the 1950s, a major need was to improve processes in manufacturing, especially chemical engineering; hence, many more statisticians were employed within the industry. O.L. Davies edited an important book on experimental design with contributions from colleagues in ICI [6]. Foremost was G.E.P. Box, son-in-law of Fisher, whose book with colleagues is one of the most important post-war classics in experimental design and multi-linear regression [7].

These statistical building blocks were already mature by the time people started calling themselves chemometricians and have changed only a little during the intervening period.

1.1.2 Statistics in Analytical and Physical Chemistry


Statistical methods, for example, to estimate accuracy and precision of measurements or to determine a best-fit linear relationship between two variables, have been available to analytical and physical chemists for over a century. Almost every general analytical textbook includes chapters on univariate statistics and has done for decades. Although theoretically we could view this as applied statistics, on the whole, the people who advanced statistics in analytical chemistry did not class themselves as applied statisticians and specialist terminology has developed over time.

Most quantitative analytical and physical chemistry until the 1970s was viewed as a univariate field; that is, only one independent variable was measured in an experiment. Usually, all other external factors were kept constant. This approach worked well in mechanics or fundamental physics, the so-called ‘One Factor at a Time’ (OFAT) approach. Hence, statistical methods were primarily used for univariate analysis of data. By the late 1940s, some analytical chemists were aware of ANOVA, F-tests and linear regression [8], although the term chemometrics had not been invented, but multivariate data came along much later.

There would have been very limited cross-fertilisation between applied statisticians, working in mathematics departments, and analytical chemists in chemistry departments, during these early days. Different departments often had different buildings, different libraries and different textbooks. A chemist, however numerate, would feel a stranger walking into a maths building and would probably cocoon him or herself in their own library. There was no such thing as the Internet or Web or Knowledge or electronic journals. Maths journals published papers for mathematicians and vice versa for chemistry journals. Although in areas such as agriculture and psychology there was a tradition of consulting statisticians, chemists were numerate and tended to talk to each other – an experimental chemist wanting to fit a straight line would talk to a physical chemist in the tea room if need be. Hence, ideas did not travel in academia. Industry was somewhat more pragmatic, but even there, the main statistical innovations were in chemical engineering and process chemistry and often classed as industrial chemistry. The top Universities often did not teach or research industrial chemistry, although they did teach Newtonian physics and relativity. In fact, the treatment of variables and errors by physicists trying, for example, to measure gravitational effects or the distance of a star is quite different to multivariate statistics: the former try to design experiments so that only one factor is studied and to make sure any errors are minimised and from one source, whereas a multivariate statistician might accept and expect data to be multifactorial.

Hence, statistics in analytical chemistry diverged from applied statistics for many decades. Caulcutt and Body's book first published in 1983 contains nothing on multivariate statistics [9] and in Miller and Miller's book of 1993 just one out of six main chapters is devoted to experimental design, optimisation and pattern recognition (including PCA) [10].

Even now, there are numerous useful books aimed at analytical and physical chemists that omit multivariate statistics. An elaborate vocabulary has developed for the needs of analytical chemists, with specialist concepts that are rarely encountered in other areas. Some analytical chemists in the 1960s to 1980s were aware that multivariate approaches existed and did venture into chemometrics, but good multivariate data were limited. Most are aware of ANOVA and experimental design. However, statistics for analytical chemistry tends to lead a separate existence from chemometrics, although multivariate methods derived from chemometrics do have a small foothold within most graduate-level courses and books in general analytical chemistry, and certainly quantitative analytical (and physical) chemistry was an important building block for modern chemometrics.

Over the last two decades, however, applications of chemometrics have moved far beyond traditional quantitative analytical chemistry, for example, into the areas of metabolomics, environment, cultural heritage or food, where the outcome is not necessarily to measure accurately the concentration of an analyte or how many compounds are in the spectra of a series of mixtures. This means that the aim of some chemometric analysis has changed. We often do not always have, for example, well-established reference samples and, in many cases, we cannot judge a method by how efficiently it predicts properties of these reference samples. We may not know whether the spectra of some extracts of urine samples can contain enough information to tell whether our donors are diseased or not: it may depend on how the disease has progressed, how good the diagnosis is, what the genetics of the donor and so on. Hence, we may never have a model that perfectly distinguishes two groups of samples. In classical physical or...