Jump to content

Topic and Talk:Regression toward the mean: Difference between pages

From Wikipedia, the free encyclopedia
(Difference between pages)
Content deleted Content added
Added another definition
 
SineBot (talk | contribs)
m Signing comment by AaCBrown - "→‎I'd like to take a crack at a rewrite: new section"
 
Line 1: Line 1:
{{WPStatistics}}
{{for|topics covered in Wikipedia|Lists of topics|Lists of basic topics}}
{{maths rating|
{{Wiktionarypar2|topic|topicality}}
|field=probability and statistics
|importance=mid
|class=Start
|vital=
|historical=
}}


'''Topic''' can refer to:
* An area of interest, also called a ''[[subject]]''
* What the compostion is about; the main object talked about in that composition.
** The focus of an [[Article (publishing)|article]] in a publication
** The focus of a discussion or [[debate]]; see [[On-topic]]
* [[Topic (linguistics)|Topic]] (or theme) in linguistics, the part of a [[proposition]] that is being talked about (predicated)
* A topic sentence, a sentence in a paragraph that indicates what the paragraph in general is about, usually the first sentence
* An [[XML]] [[topic (XML)|topic]] (a kind of resource)
* [[Topicality_%28policy_debate%29|Topicality]], a stock issue in policy debate which pertains to whether or not the plan affirms the resolution as worded
* [[Topic Map]]s, an ISO standard for the representation and interchange of knowledge
* Topic as [[Meta-information in linguistics]]


This topic (the article + this discussion) reads like a mad grad school breakdown. If you please, would someone who has a good grasp of Regression Toward the Mean write an explanation based on ONE COGENT EXAMPLE that reveals unambiguous data, processing steps, results. The audience is dying to know what regression means to them. What is needed is an actual dataset and walkthrough to illustrate the concept. You know, narrate Galton's height experiment, that would be wildly appropriate. Think of your readers as high schoolers stuck with a snotty textbook who want some mentoring on this subject AT THEIR LEVEL. They'll get a kick out of it if you can make it mean something to them, otherwise they'll drop out and live in shipping containers with Teener-Kibble for sustenance. This is, after all, a topic that only first year stats student should still be grappling with, yes? And of course it is Wikipedia.--[[User:24.113.89.98|24.113.89.98]] 05:24, 23 January 2007 (UTC)qwiki@edwordsmith.com
'''Names and titles''':
-------------------------
* [[Dado Topić]], [[Croatia]]n rocker
* [[Topic (chocolate bar)]], a brand of milk chocolate confectionery bar with hazelnuts in soft nougat and smooth caramel
* [[Topic Records]], a British record label
* [[Topics (Aristotle)|''Topics'']], a work by Aristotle
* New papers on Topic [http://www.celta.paris-sorbonne.fr/publications/indiv/wl/] and other Centres of Attention of linguistic units (utterances, dialogues and texts)


==Real Data==
{{disambig}}
I have added an real analysis of Francis's data for the problem of regression is better illutrate the law.
--[[User:Puekai|Puekai]] ([[User talk:Puekai|talk]]) 08:30, 11 July 2008 (UTC)


I'm not sure this page explains "regression to the mean" very well.
[[de:Topik]]

[[da:Topik]]
:I agree; it's lousy. [[User:Michael Hardy|Michael Hardy]] 23:26, 2 Feb 2004 (UTC)
[[fr:Topique (homonymie)]]

[[nl:Topic]]
:The first time I read it, I thought it was lousy. The second time I read it, it was closer to mediocre.
[[ru:Топика]]

[[ja:主題]]
F. Galton's use of the terms "reversion" and "regression" described a certain, specific biological phenomenon, and it is connected with the stability of an autoregressive process: if there is not regression to the mean, the variance of the process increases over time. There is no reason to think that the same or a similar phenomenon occurs in, say, scores of students, and appealing to a general "principle of regression to the mean" is unwarranted.
[[ja:話題]]

[[ja:題目]]
:I completely disagree with this one; there is indeed such a general principle. [[User:Michael Hardy|Michael Hardy]] 23:26, 2 Feb 2004 (UTC)

I guess I could be convinced of the existence of such a principle, but something more than anecdotes is needed to establish that.

:Absolutely. A rationale needs to be given. [[User:Michael Hardy|Michael Hardy]] 23:26, 2 Feb 2004 (UTC)

Regression to the mean is just like normality of natural populations: maybe it's there, maybe it isn't; the only way to tell is to study a lot of examples.

:No; it's not just empirical; there is a perfectly good rationale.

I'll revise this page in a week or two if I don't hear otherwise; the page should summarize Galton's findings,

:I don't think ''regression toward the mean'' should be taken to mean only what Galton wrote about; it's far more general. I'm really surprised that someone who's edited a lot of statistics articles here does not know that there is a reason why regression toward the mean in widespread, and what the reason is. I'll return to this article within a few days. [[User:Michael Hardy|Michael Hardy]] 23:26, 2 Feb 2004 (UTC)

connect the biological phenomenon with autoregressive stability, and mention other (substantiated) examples. [[User:Wile E. Heresiarch|Wile E. Heresiarch]] 15:00, 2 Feb 2004 (UTC)

----
In response to Michael Hardy's comments above --
# Perhaps I overstated the case. Yes, there is a class of distributions which show regression to the mean. (I'm not sure how big it is, but it includes the normal distribution, which counts for a lot!) However, if I'm not mistaken there are examples that don't, and these are by no means exotic.
# There is a terminology problem here -- it's not right to speak of a "principle of r.t.t.m." as the article does, since r.t.t.m. is a demonstrated property (i.e., a theorem) of certain distributions. "Principle" suggests that it is extra-mathematical, as in "likelihood principle". Maybe we can just drop "principle".
# I had just come over from the Galton page, & so that's why I had Galton impressed on my mind; this article should mention him but need not focus on his concept of regression, as pointed out above.
regards & happy editing, [[User:Wile E. Heresiarch|Wile E. Heresiarch]] 22:57, 3 Feb 2004 (UTC)

It's nothing to do with Normality - it applies to all distributions.
::::[[User:Johnbibby|Johnbibby]] 22:11, 12 December 2006 (UTC)

--

The opening sentence "of related measurements, the second is expected to be closer to the mean than the first" is obviously wrong.[[User:Jdannan|Jdannan]] 08:17, 15 December 2005 (UTC)



Small change to the historical background note.

== Principle of Regression ==

I agree that the "principle" cannot hold for all distributions, but only a certain class of them, which includes the normal distributions. I think R. A. Fisher found an extension to the case where the conditional distribution is Gaussian but the joint distribution need not be. In any case, in the section on "Mathematical Derivation", it should be made clear that the specific *linear* regression form E[Y|X]=rX is valid only when Y and X are jointly Gaussian. Of course there are some other examples such as when Y and X are jointly stable but that is another can of worms. The overall question might be rephrased: given two random variables X and Y of 0 mean and the same variance, for what distributions is |E[Y|X]| < |X| almost surely?

I will make some small edits to the "mathematical derivation" section.

==Intelligence==

[[Linda Gottfredson]] points out that 40% of mothers having [[IQ]] of 75 or less also have children whose IQ is under 75 - as opposed to 7% of normal or bright mothers.

:Fortunately, because of regression to the mean, their children will tend to be brighter than they are, but 4 in 10 still have IQs below 75. ([http://www.udel.edu/educ/gottfredson/reprints/1997whygmatters.pdf Why g matters, page 40])

What do we know about IQ or ''[[g (factor)|g]]'' and regression toward the mean? [[User:Elabro|Elabro]] 18:55, 5 December 2005 (UTC)

:Your question seems to contain its own answer. Taking everything at face value, and brushing aside ''all'' the arguments (whether g exists, whether it means anything, whether Spearman's methodology was sound, whether imprecise measurements of g should be used to make decisions about people's lives, etc.) what the numbers you cite mean is simply that IQ measurements are ''mixtures'' of something that is inherited and something that is not inherited.

:Intelligence, as measured by IQ score, is just about 50% heritable.

:Regression doesn't have to do with the child, in this case, it has to do with the mother. The lower the mother's IQ measurement, the further away from the mean it is. The further away from the mean it is, ''the more likely'' that this was not the result of something inherited but of some other factor, one which won't be passed on to the child, who will therefore be expected to have higher intelligence than the mother.

:This isn't obvious at first glance but it is just plain statistics. Our article on regression doesn't have any diagrams, and one is needed here. [[User:Dpbsmith|Dpbsmith]] [[User_talk:dpbsmith|(talk)]] 20:26, 5 December 2005 (UTC)

::Thanks for explaining that. It's clear to me now, and I hope we can also make it clear to the reader.

::By the way, I'm studying "[[inheritance]]" and "[[heritage]]" and looking for factors (such as genes) that one cannot control, as well as factors (such as parenting techniques, choice of neighborhood and school) that one can control - and how these factors affect the [[academic achievement]] of children. This is because I'm interested in [[Educational reform]], a topic that Wikipedia has long neglected. [[User:Elabro|Elabro]] 22:10, 5 December 2005 (UTC)


I am having a difficult time believing the regression to mean effect in certain circumstances. For example, if 2 parents of equal IQ, say an IQ of 130, have children, if heritability is .7 that is the nature component of IQ right? The other .3 is often stated as being the mean of the population at large but that does not make sense to me. Wouldn't it be nurture, the food they consume and environmental stimulation they receive?

Does anyone have any statistics on high IQ (specific IQ values), well off parents who have children, and the children's IQ scores? I have not been able to find any and it is making it very difficult for me to believe that this effect is real if the parents specifically choose each other for their IQ.

Regression to mean as it is often used could imply that evolution into different species is not possible. I remember reading about insects in an underground cave that were recently discovered in Israel. There was no light in the cave. The insects in there had evolved no eyes. Regression to mean would imply no matter how much smaller or diminished a group of insects eyes were, their offspring would regress to the mean and have normal eyes yet over time the offspring evolved and evolved and ended up with no eyes. [[Special:Contributions/72.209.12.250|72.209.12.250]] ([[User talk:72.209.12.250|talk]]) 04:18, 13 June 2008 (UTC)

== Massachusetts test scores ==

HenryGB has twice removed a reference supporting the paragraph that gives MCAS "improvement" scores as a good example of the regression fallacy. He cites http://groups.google.com/group/sci.stat.edu/tree/browse_frm/thread/c1086922ef405246/60bb528144835a38?rnum=21&hl=en&_done=%2Fgroup%2Fsci.sta which I haven't had a chance to review. At the very least, it is extremely inappropriate to remove the reference supporting a statement without also removing the statement.

We need to decide whether this is a clear case of something that is ''not'' regression, in which case it doesn't belong in the article; or whether it's the usual case of a somewhat murky situation involving real-world data that isn't statistically pure, in a politically charged area, where different factions put a different spin on the data. If it's the latter, then it should go back with qualifying statements showing that not everyone agrees this is an actual example of regression. As I say, I haven't read his reference yet, so I don't know yet which I think. I gotta say that when I saw the headlines in the Globe about how shocked parents in wealthy towns were that their schools had scored much lower than some troubled urban schools on these "improvement" scores, the first thing that went through my mind was "regression." [[User:Dpbsmith|Dpbsmith]] [[User_talk:dpbsmith|(talk)]] 12:04, 31 March 2006 (UTC)

== Poorly written ==

The introduction is poorly written and fairly confusing.


== "SAT" ==

Would be better with an example that means something to those of us reading outside the USA. --[[User:Newshound|Newshound]] 16:08, 5 March 2007 (UTC)

== Sports info out of date ==

:''The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy. [[Melvin Mora]] of the [[Baltimore Orioles]] put up a season in [[2003]], at age 31, that was so far away from his performance in prior seasons that analysts assumed it had to be an outlier... but in [[2004]], Mora was even better. Mora, then, had truly established a new level of production, though he will likely regress to his more reasonable 2003 numbers in [[2005]].''

It's now 2007, but I don't know enough about baseball to comment on Mora's performance in 2005 or afterward. I also don't know how to tag this statement as out of date without using an "as of 2004" or "as of 2005" tag (I'm not sure how one could be worked in). Can anybody help? - [[User:Furrykef|furrykef]] ([[User_talk:Furrykef|Talk at me]]) 08:42, 4 April 2007 (UTC)

I have great difficulty understanding this article. Everything including the math is just a mess. It is quite remarkable that I have never heard of the phenomenon "regression to the mean", and it seems that its usage is restricted to certain group such medical and socio.

My guess is that there are two phenomena a) the biological property related to growth first observed in the 19th century, and b) an obvious matter. Let me explain b) the obvious matter. I have a die with possible outcomes {1, ..., 6}. Assume I threw a 6. Then the next time I throw that die, it is very likely that the outcome will be les than 6 (since there is no 7!) If one calls that 'regression to the mean', the expression is more complicated than the fact itself. Can anybody comment.[[User:Sabbah67|Sabbah67]] 13:54, 13 August 2007 (UTC)


== "History" ==

I think the history section is quite good except I think the history of the regression line is a bit off topic. Only if more detail were included (such as a discussion of the implications of the fact that the regression line had a slope <1) would the typical reader see the relevance. My opinion is that the regression line discussion be deleted but I don't feel strongly enough about it to do so myself. <small>—Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/128.42.98.167|128.42.98.167]] ([[User talk:128.42.98.167|talk]]) 19:09, 25 September 2007 (UTC)</small><!-- Template:UnsignedIP --> <!--Autosigned by SineBot-->

== Defeating regression by establishing variance ==

Right, so i'm measuring quantity X over a population, and looking for an effect of applying treatment A.

If i measure X for all individuals, apply A to the lowest-scoring half, and measure again, i'll see an apparent increase because of RTM, right?

If i apply A to half the population at random, or to a stratified sample, can i expect to not see RTM?

Now, my real question, i guess, if i measure X ten times over the course of a year, then apply A to the lowest-scoring half, then measure X ten more times over the next year, then calculate the mean and variance / standard deviation / standard error of the mean for each individual, and look for improvements by t-testing, would i see an effect of RTM?

If i understand it right, RTM works because the value of X is some kind of underlying true value, plus an error term. If i pick the lowest values of X, i get not only individuals who genuinely have a low true X, but also those with a middling X who happened to have a negative error term when i measured X. Assuming the error term is random, doesn't that mean that taking multiple measurements and working out the envelope of variance allows me to defeat RTM?

-- Tom Anderson 2008-02-18 1207 +0000 <small>—Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[Special:Contributions/62.56.86.107|62.56.86.107]] ([[User talk:62.56.86.107|talk]]) 12:07, 18 February 2008 (UTC)</small><!-- Template:UnsignedIP --> <!--Autosigned by SineBot-->

== POV template ==

This article is terrible.

"If you choose a subset of people who score above the mean, they will be (on average) above the mean on skill and above the mean on luck." -this is not cited at all. additionally, its this provides no information about the process of choice used.

"a class of students takes a 100-item true/false test on a subject on which none of the students knows anything at all. '''Therefore''', all students choose randomly on all questions leading to a mean score of about 50." - therefore is obviously the wrong word here.

"Real situations fall between these two extremes: scores are a combination of skill and luck." -uncited

"It is important to realize" -obvious POV

"he couldn't possibly be expected to repeat it" -again

"The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy." -again

"the findings appear to be a case of regression to the mean." -uncited, pov

"Statistical analysts have long recognized the effect of regression to the mean in sports" - more pov

"Regression to the mean in sports performance produced the "Sports Illustrated Cover Jinx" superstition, in all probability." - you get the idea


etc etc

Last, but certainly not least, the appalling:

"Whatever you call it, though, regression to the mean is a fact of life, and also of sports." -make it stop make it stop


This thing needs a complete rewrite, from the ground up. [[Special:Contributions/219.73.78.161|219.73.78.161]] ([[User talk:219.73.78.161|talk]]) 15:28, 26 June 2008 (UTC)

: I am not sure the POV template is the right one however. Most of your complaints are more about sourcing (or lack thereof) and prose style, specifically that it sounds like how-to and/or an essay (which WP is [[Wikipedia:NOTHOWTO#HOWTO|not]]). That said, I agree the article has serious deficiencies. [[User:Baccyak4H|Baccyak4H]] ([[User talk:Baccyak4H|Yak!]]) 14:19, 16 July 2008 (UTC)
::It's difficult to source that which seems obvious, and which in general aren't statements of fact but rather mathematical tautologies. I think this article should be rewritten to explain things more clearly, but I can't agree that noting that for example, the distinction between 'progression and time' etc is 'obvious POV', any more than stating 1+1=2 is 'obvious POV', or indeed requires much citing.--[[User:Fangz|Fangz]] ([[User talk:Fangz|talk]]) 18:02, 21 July 2008 (UTC)

== Cleanup ==

I've done some cleanup on the commented-out section; here's the cleaned-up version. Probably it could use more work:

=== Francis Galton's experiment ===
The data can be available from [http://www.medicine.mcgill.ca/epidemiology/hanley/galton/notebook/index.html] and [http://www.medicine.mcgill.ca/epidemiology/hanley/galton/galton_heights_197_families.txt]. They are post-processed by listing all 934 children, of which 481 are male. Some children shared the same parents, so they have the same mid-parent height.
Galton assumed that the marriage of people is independent of the heights difference. So it is a [[random mating]].

Initial calculation suggests that the means of the two generations are the same, i.e. 69.219 inches (Galton's result is <math>68\frac{1}{4}</math> inches, it seems he is wrong in calculation) for the mid-parents(69.316 inches for fathers and 64.002 inches for mothers) and 69.175 inches for the offsprings. Not only the [[mean]]s, but the variances are also the same, i.e. 2.647<sup>2</sup> for the fathers, 2.512<sup>2</sup> for the mothers(after scaling by a factor of 1.08) and 2.623<sup>2</sup> for the sons and 2.5445<sup>2</sup> for the daughters. But the [[standard deviation]] of mid-parent is 1.802 inch, or

: <math>\frac{1.802^2}{2.607^2}=0.486\approx 0.5</math>

of the population. This can be easily explained by the fact that

: <math>\text{midparent} = \frac{1}{2} (\text{fathers height} + 1.08 \times \text{mothers height}),</math>

therefore the [[variance]]s:

: <math>\sigma^2_\text{midparent}=\frac{1}{4}\sigma^2_\text{fathers height}+\frac{1.08^2}{4}\sigma^2_\text{mothers height}\approx\frac{1}{2}\sigma^2_\text{fathers height}.</math>

Further investigation suggests that the [[correlation coefficient]] between midparent heights and offspring heights is 0.497, which is not very [[linear]].

If we use least-square method to approximate, we will get by using the following [[Matlab]] code:
<source lang="matlab">
pinv([midparent(1:934) ones(934,1)])*offsprings(1:934)
</source>
and obtain

: <math>
\begin{align}
\text{offspring} & {} = 0.713\times \text{midparent} + 19.874 \\
& {} = 0.713\times \text{midparent} + 0.287 \times \text{population mean (inches)}.
\end{align}
</math>

This is illustrated as the blue line in Figure 1.

In fact, if [[least squares method]] is used to [[estimate]] the [[slope]], it would be equal to

: <math>r\frac{S_Y}{S_X},</math>

or in this case,

: <math>0.497\times\frac{2.607}{1.807}=0.713\approx \frac{\sqrt 2}{2}\approx\frac{2}{3}.</math>

[[Image:Galton experiment.png|700px|thumbnail|'''Figure 1''' The distribution of the 934 adult children of [[Galton]]'s experiment against their corresponding [[midparent]] heights. The blue line is the [[least squares]] approximation result while the brown one is the approximation of the [[median]]s of the 11 categories of midparents. The semimajor axis is the principal component of the variable space. It is apparent in the figure that the extreme parents do not give births to as extreme offspring, on average. And those extreme offsprings (cycled stars) are born by ordinary parents.]]

The ellipse indicates the [[covariance matrix]] of the offsprings and mid-parents. It is given by

: <math>\left\{(x,y)|[x\; y]^T = {\rm cov}(\text{midparant},\text{offspring})\times [\cos t\; \sin t]^T,\quad t\in(0, 2\pi]\right\}.</math>

Numerically,

: <math>{\rm cov}(\text{midparent},\text{offspring})=\begin{bmatrix}3.248&2.317\\2.317&6.678\end{bmatrix}.</math>

It is also the [[contour]] of the 2-dimensional [[Gaussian distribution]] because

: <math> \frac{1}{(2\pi)^{\frac{1}{N}} | \Sigma |^{\frac{1}{2}}} e^{\frac{-1}{2} (\mathbf{x} - \mathbf{u_X})^T \Sigma^{-1} (\mathbf{x}-\mathbf{u_X})}=C</math>

: <math>\Rightarrow (\mathbf{x} - \mathbf{u_X})^T \Sigma^{-1} (\mathbf{x} - \mathbf{u_X}) = 2\ln{C(2\pi)^{\frac{1}{N}} |\Sigma|^{\frac{1}{2}}}</math>

: <math>\Rightarrow (\mathbf{x}-\mathbf{u_X})^TU \begin{bmatrix}
\frac{1}{\sigma^2_1} &0 \\
0 & \frac{1}{\sigma^2_2} \end{bmatrix} U^T(\mathbf{x}-\mathbf{u_X})=2\ln{C(2\pi)^{\frac{1}{N}}|\Sigma|^{\frac{1}{2}}}</math>

: <math>\Rightarrow (\bar{\mathbf{x}}-\mathbf{u_X})^T\begin{bmatrix} \frac{1}{\sigma^2_1} &0 \\
0 & \frac{1}{\sigma^2_2} \end{bmatrix}(\bar{\mathbf{y}}-\mathbf{u_Y})=\bar{C}</math>

: <math>\Rightarrow \frac{(\bar{x}-u_X)^2}{\sigma^2_1}+\frac{(\bar{y}-u_Y)^2}{\sigma^2_2}=\bar{C}.</math>

So the ellipse is where

: <math>\bar{C}=1\text{ or }2\ln{C(2\pi)^{\frac{1}{N}}|\Sigma|^{\frac{1}{2}}}=1.</math>

The slope of the semimajor axis indicates the ratio of the variances and the length of it is the standard deviation of the [[principal component]]. It would be interesting to note the [[counter-intuitive]] phenomenon that the slope of the ellipse does not allign with those of the fitted lines, including least-squares and median fitted lines.

Since 0.713 is much smaller than 1.0, therefore we may conclude that for parents of extreme heights, their children will not be as extreme. Galton didn't use least square to approximate, instead he quantized the mid-parent heights into 11 categories, namely, 'Below','64.5', '65.5', '66.5', '67.5', '68.5', '69.5', '70.5', '71.5', '72.5', 'Above'. And for each category, he found out the median of all the offsprings by this category. He drew a line through the medians and found out that the line's slope is about 2/3, perfectly matches the slope predicted by least square method. He therefore concluded that the offspring are not as extreme as their parents, which he termed the law of regression. But this conclusion is easily misunderstood as one may further conclude from this that the height variance will decreases steadily over generations. In fact, the [[variance]](2.5842<sup>2</sup>) of the 934 offspring is almost the same as that of the fathers and the scaled mothers and 2 times that of 'midparents'.

The reason that 'offsprings' height's regress towards mean' is because Galton only analyzed the medians but ignore dthe fact that for each category of mid-parent height, e.g. '72.5', the variance of its offspring is not the same as that of '71.5' nor 'Above 72.5'. For mid-parent approaching population mean, their offsprings' heights are much more dispersed than that of the offsprings of extreme midparents.

For example, for midparent category 'Above 72.5', their offsprings only concentrate on '72.2' and '73.2', whereas for category of '72.5', their offsprings span from '68.2' to 'Above 73.2'. As a result of this, the next generation as a whole will still have the same variance as their fathers and mothers. But for the extreme tall or short offsprings, it is more likely that their midparents are not as extreme as them.

== Commented out section ==

I probably oughtta explain myself a bit here. I realise that various editors have put a lot of work into Galton's example, which I commented out. But I mean, my main problem with that section is that I can't really see what it is adding to the article - to even realise it is an example of regression toward the mean, for example, would require some understanding of the Linear Model, and the relation with regression lines, and specifically how the slope of the line being under 1 is what is important here. The specifics of stuff like the principal component stuff just isn't very relevant to the article, as far as I can tell. Maybe a clarified image with explanation, plus some short comments on the source of the regression towards the mean, would be better. Similarly, with mathematical derivation, I'd like to replace it with something a bit more general, since for example the root 2 issue isn't very important in general.--[[User:Fangz|Fangz]] ([[User talk:Fangz|talk]]) 19:37, 21 July 2008 (UTC)

: Possibly it should be made a separate article to explain the history of the idea. [[User:Michael Hardy|Michael Hardy]] ([[User talk:Michael Hardy|talk]]) 19:40, 21 July 2008 (UTC)

:: Actually, on second thoughts, the problem is a bit broader than the section - I myself am a bit confused about what the linear regression stuff is doing in e.g. the section on mathematical derivation. I suspect the main problem is that we are trying to overgeneralise in the 'ubiquity' section. It seems to be much simpler in the identically distributed but correlated regime, compared to whatever-the-heck 'ubiquity' is aiming at, which seems to try to relax assumptions but ends up assuming not merely normality but also a linear model without obvious explanation. Or maybe it's not simpler, argh. I get the distinct feeling we are overcomplicating something very obvious.--[[User:Fangz|Fangz]] ([[User talk:Fangz|talk]]) 21:54, 21 July 2008 (UTC)

Linear regression definitely belongs in this article. If linear regression is not mentioned, this article is definitely incomplete. But the article doesn't need to be complicated. [[User:Michael Hardy|Michael Hardy]] ([[User talk:Michael Hardy|talk]]) 04:36, 22 July 2008 (UTC)

==My Understanding==

If one were to test a group of students then they would find that the results fit on a bell curve. Likewise if one were to test the same student 100 times (and assuming they didn't improve) they would find that their results would fit on a small bell curve. Because of this the top 50% is going to have better luck in general then the bottom 50%. Every one on average is going to have average luck next time, and that would be worse for the top 50% and better for the lower 50%

For example: If you look at the top 15 MLB teams (or 50%) on July 1st 2007 they had won 56.2% of their games. Over the rest of the season they won 51.3% of games. This is because most of the teams are average and just had flukes the first half. They still won the majority of the 2nd half games because there are a couple good teams.

In general most of a group are average, and the top half consist mostly of average people with a good day. They will do worse on average the next time because they have average luck usually.
[[Special:Contributions/72.42.134.253|72.42.134.253]] ([[User talk:72.42.134.253|talk]]) 02:11, 7 August 2008 (UTC)

== I'd like to take a crack at a rewrite ==

But I'm pretty new to editing and I don't want to do it wrong. Here is my plan, please let me know if it offends you.

Summary, needs to be expanded a bit and made clear

Example, needs to be made clearer

History, it's important to mention Galton because he named the effect, and also because it explains the name of regression analysis. But this is of interest to history of statistics, not to regression to the mean. His is not a particularly good example, and his explanation of it uses obscure language and conflates biology and mathematics. The discussion of regression lines is irrelevant, the reader should be referred to the article on regression after being told this is the source of the name.

Ubiquity, I would rename and rewrite this section to distinguish among different effects that are sometimes referred to as regression toward the mean. There is a biological principle, and engineering principle and a mixing principle, in addition to the statistical principle under discussion. The idea is also related to shrinkage, which is important, and also important to distinguish.

Mathematical derivation, I don't want to insult the author, but the steps are trivial algebra that don't start at a natural place nor lead to any insight. Also the use of rho for the regression coefficient usually labeled beta is confusing (since rho is usually the correlation coefficient that the author labels r). The text is confusing. I don't think any math is necessary. I would think about putting in a short theoretical section.

Regression fallacies, I think this is excellent, and an important part of the article. I would shorten the "In Sports" and "In road safety policy" sections and combine them with this. I would number or bullet the list. <small><span class="autosigned">—Preceding [[Wikipedia:Signatures|unsigned]] comment added by [[User:AaCBrown|AaCBrown]] ([[User talk:AaCBrown|talk]] • [[Special:Contributions/AaCBrown|contribs]]) 20:29, 13 October 2008 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot-->

Revision as of 20:30, 13 October 2008

WikiProject iconStatistics Unassessed
WikiProject iconThis article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
???This article has not yet received a rating on Wikipedia's content assessment scale.
???This article has not yet received a rating on the importance scale.
WikiProject iconMathematics Start‑class Mid‑priority
WikiProject iconThis article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
StartThis article has been rated as Start-class on Wikipedia's content assessment scale.
MidThis article has been rated as Mid-priority on the project's priority scale.


This topic (the article + this discussion) reads like a mad grad school breakdown. If you please, would someone who has a good grasp of Regression Toward the Mean write an explanation based on ONE COGENT EXAMPLE that reveals unambiguous data, processing steps, results. The audience is dying to know what regression means to them. What is needed is an actual dataset and walkthrough to illustrate the concept. You know, narrate Galton's height experiment, that would be wildly appropriate. Think of your readers as high schoolers stuck with a snotty textbook who want some mentoring on this subject AT THEIR LEVEL. They'll get a kick out of it if you can make it mean something to them, otherwise they'll drop out and live in shipping containers with Teener-Kibble for sustenance. This is, after all, a topic that only first year stats student should still be grappling with, yes? And of course it is Wikipedia.--24.113.89.98 05:24, 23 January 2007 (UTC)qwiki@edwordsmith.com


Real Data

I have added an real analysis of Francis's data for the problem of regression is better illutrate the law. --Puekai (talk) 08:30, 11 July 2008 (UTC)

I'm not sure this page explains "regression to the mean" very well.

I agree; it's lousy. Michael Hardy 23:26, 2 Feb 2004 (UTC)
The first time I read it, I thought it was lousy. The second time I read it, it was closer to mediocre.

F. Galton's use of the terms "reversion" and "regression" described a certain, specific biological phenomenon, and it is connected with the stability of an autoregressive process: if there is not regression to the mean, the variance of the process increases over time. There is no reason to think that the same or a similar phenomenon occurs in, say, scores of students, and appealing to a general "principle of regression to the mean" is unwarranted.

I completely disagree with this one; there is indeed such a general principle. Michael Hardy 23:26, 2 Feb 2004 (UTC)

I guess I could be convinced of the existence of such a principle, but something more than anecdotes is needed to establish that.

Absolutely. A rationale needs to be given. Michael Hardy 23:26, 2 Feb 2004 (UTC)

Regression to the mean is just like normality of natural populations: maybe it's there, maybe it isn't; the only way to tell is to study a lot of examples.

No; it's not just empirical; there is a perfectly good rationale.

I'll revise this page in a week or two if I don't hear otherwise; the page should summarize Galton's findings,

I don't think regression toward the mean should be taken to mean only what Galton wrote about; it's far more general. I'm really surprised that someone who's edited a lot of statistics articles here does not know that there is a reason why regression toward the mean in widespread, and what the reason is. I'll return to this article within a few days. Michael Hardy 23:26, 2 Feb 2004 (UTC)

connect the biological phenomenon with autoregressive stability, and mention other (substantiated) examples. Wile E. Heresiarch 15:00, 2 Feb 2004 (UTC)


In response to Michael Hardy's comments above --

  1. Perhaps I overstated the case. Yes, there is a class of distributions which show regression to the mean. (I'm not sure how big it is, but it includes the normal distribution, which counts for a lot!) However, if I'm not mistaken there are examples that don't, and these are by no means exotic.
  2. There is a terminology problem here -- it's not right to speak of a "principle of r.t.t.m." as the article does, since r.t.t.m. is a demonstrated property (i.e., a theorem) of certain distributions. "Principle" suggests that it is extra-mathematical, as in "likelihood principle". Maybe we can just drop "principle".
  3. I had just come over from the Galton page, & so that's why I had Galton impressed on my mind; this article should mention him but need not focus on his concept of regression, as pointed out above.

regards & happy editing, Wile E. Heresiarch 22:57, 3 Feb 2004 (UTC)

It's nothing to do with Normality - it applies to all distributions.

Johnbibby 22:11, 12 December 2006 (UTC)

--

The opening sentence "of related measurements, the second is expected to be closer to the mean than the first" is obviously wrong.Jdannan 08:17, 15 December 2005 (UTC)


Small change to the historical background note.

Principle of Regression

I agree that the "principle" cannot hold for all distributions, but only a certain class of them, which includes the normal distributions. I think R. A. Fisher found an extension to the case where the conditional distribution is Gaussian but the joint distribution need not be. In any case, in the section on "Mathematical Derivation", it should be made clear that the specific *linear* regression form E[Y|X]=rX is valid only when Y and X are jointly Gaussian. Of course there are some other examples such as when Y and X are jointly stable but that is another can of worms. The overall question might be rephrased: given two random variables X and Y of 0 mean and the same variance, for what distributions is |E[Y|X]| < |X| almost surely?

I will make some small edits to the "mathematical derivation" section.

Intelligence

Linda Gottfredson points out that 40% of mothers having IQ of 75 or less also have children whose IQ is under 75 - as opposed to 7% of normal or bright mothers.

Fortunately, because of regression to the mean, their children will tend to be brighter than they are, but 4 in 10 still have IQs below 75. (Why g matters, page 40)

What do we know about IQ or g and regression toward the mean? Elabro 18:55, 5 December 2005 (UTC)

Your question seems to contain its own answer. Taking everything at face value, and brushing aside all the arguments (whether g exists, whether it means anything, whether Spearman's methodology was sound, whether imprecise measurements of g should be used to make decisions about people's lives, etc.) what the numbers you cite mean is simply that IQ measurements are mixtures of something that is inherited and something that is not inherited.
Intelligence, as measured by IQ score, is just about 50% heritable.
Regression doesn't have to do with the child, in this case, it has to do with the mother. The lower the mother's IQ measurement, the further away from the mean it is. The further away from the mean it is, the more likely that this was not the result of something inherited but of some other factor, one which won't be passed on to the child, who will therefore be expected to have higher intelligence than the mother.
This isn't obvious at first glance but it is just plain statistics. Our article on regression doesn't have any diagrams, and one is needed here. Dpbsmith (talk) 20:26, 5 December 2005 (UTC)
Thanks for explaining that. It's clear to me now, and I hope we can also make it clear to the reader.
By the way, I'm studying "inheritance" and "heritage" and looking for factors (such as genes) that one cannot control, as well as factors (such as parenting techniques, choice of neighborhood and school) that one can control - and how these factors affect the academic achievement of children. This is because I'm interested in Educational reform, a topic that Wikipedia has long neglected. Elabro 22:10, 5 December 2005 (UTC)


I am having a difficult time believing the regression to mean effect in certain circumstances. For example, if 2 parents of equal IQ, say an IQ of 130, have children, if heritability is .7 that is the nature component of IQ right? The other .3 is often stated as being the mean of the population at large but that does not make sense to me. Wouldn't it be nurture, the food they consume and environmental stimulation they receive?

Does anyone have any statistics on high IQ (specific IQ values), well off parents who have children, and the children's IQ scores? I have not been able to find any and it is making it very difficult for me to believe that this effect is real if the parents specifically choose each other for their IQ.

Regression to mean as it is often used could imply that evolution into different species is not possible. I remember reading about insects in an underground cave that were recently discovered in Israel. There was no light in the cave. The insects in there had evolved no eyes. Regression to mean would imply no matter how much smaller or diminished a group of insects eyes were, their offspring would regress to the mean and have normal eyes yet over time the offspring evolved and evolved and ended up with no eyes. 72.209.12.250 (talk) 04:18, 13 June 2008 (UTC)

Massachusetts test scores

HenryGB has twice removed a reference supporting the paragraph that gives MCAS "improvement" scores as a good example of the regression fallacy. He cites http://groups.google.com/group/sci.stat.edu/tree/browse_frm/thread/c1086922ef405246/60bb528144835a38?rnum=21&hl=en&_done=%2Fgroup%2Fsci.sta which I haven't had a chance to review. At the very least, it is extremely inappropriate to remove the reference supporting a statement without also removing the statement.

We need to decide whether this is a clear case of something that is not regression, in which case it doesn't belong in the article; or whether it's the usual case of a somewhat murky situation involving real-world data that isn't statistically pure, in a politically charged area, where different factions put a different spin on the data. If it's the latter, then it should go back with qualifying statements showing that not everyone agrees this is an actual example of regression. As I say, I haven't read his reference yet, so I don't know yet which I think. I gotta say that when I saw the headlines in the Globe about how shocked parents in wealthy towns were that their schools had scored much lower than some troubled urban schools on these "improvement" scores, the first thing that went through my mind was "regression." Dpbsmith (talk) 12:04, 31 March 2006 (UTC)

Poorly written

The introduction is poorly written and fairly confusing.


"SAT"

Would be better with an example that means something to those of us reading outside the USA. --Newshound 16:08, 5 March 2007 (UTC)

Sports info out of date

The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy. Melvin Mora of the Baltimore Orioles put up a season in 2003, at age 31, that was so far away from his performance in prior seasons that analysts assumed it had to be an outlier... but in 2004, Mora was even better. Mora, then, had truly established a new level of production, though he will likely regress to his more reasonable 2003 numbers in 2005.

It's now 2007, but I don't know enough about baseball to comment on Mora's performance in 2005 or afterward. I also don't know how to tag this statement as out of date without using an "as of 2004" or "as of 2005" tag (I'm not sure how one could be worked in). Can anybody help? - furrykef (Talk at me) 08:42, 4 April 2007 (UTC)

I have great difficulty understanding this article. Everything including the math is just a mess. It is quite remarkable that I have never heard of the phenomenon "regression to the mean", and it seems that its usage is restricted to certain group such medical and socio.

My guess is that there are two phenomena a) the biological property related to growth first observed in the 19th century, and b) an obvious matter. Let me explain b) the obvious matter. I have a die with possible outcomes {1, ..., 6}. Assume I threw a 6. Then the next time I throw that die, it is very likely that the outcome will be les than 6 (since there is no 7!) If one calls that 'regression to the mean', the expression is more complicated than the fact itself. Can anybody comment.Sabbah67 13:54, 13 August 2007 (UTC)


"History"

I think the history section is quite good except I think the history of the regression line is a bit off topic. Only if more detail were included (such as a discussion of the implications of the fact that the regression line had a slope <1) would the typical reader see the relevance. My opinion is that the regression line discussion be deleted but I don't feel strongly enough about it to do so myself. —Preceding unsigned comment added by 128.42.98.167 (talk) 19:09, 25 September 2007 (UTC)

Defeating regression by establishing variance

Right, so i'm measuring quantity X over a population, and looking for an effect of applying treatment A.

If i measure X for all individuals, apply A to the lowest-scoring half, and measure again, i'll see an apparent increase because of RTM, right?

If i apply A to half the population at random, or to a stratified sample, can i expect to not see RTM?

Now, my real question, i guess, if i measure X ten times over the course of a year, then apply A to the lowest-scoring half, then measure X ten more times over the next year, then calculate the mean and variance / standard deviation / standard error of the mean for each individual, and look for improvements by t-testing, would i see an effect of RTM?

If i understand it right, RTM works because the value of X is some kind of underlying true value, plus an error term. If i pick the lowest values of X, i get not only individuals who genuinely have a low true X, but also those with a middling X who happened to have a negative error term when i measured X. Assuming the error term is random, doesn't that mean that taking multiple measurements and working out the envelope of variance allows me to defeat RTM?

-- Tom Anderson 2008-02-18 1207 +0000 —Preceding unsigned comment added by 62.56.86.107 (talk) 12:07, 18 February 2008 (UTC)

POV template

This article is terrible.

"If you choose a subset of people who score above the mean, they will be (on average) above the mean on skill and above the mean on luck." -this is not cited at all. additionally, its this provides no information about the process of choice used.

"a class of students takes a 100-item true/false test on a subject on which none of the students knows anything at all. Therefore, all students choose randomly on all questions leading to a mean score of about 50." - therefore is obviously the wrong word here.

"Real situations fall between these two extremes: scores are a combination of skill and luck." -uncited

"It is important to realize" -obvious POV

"he couldn't possibly be expected to repeat it" -again

"The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy." -again

"the findings appear to be a case of regression to the mean." -uncited, pov

"Statistical analysts have long recognized the effect of regression to the mean in sports" - more pov

"Regression to the mean in sports performance produced the "Sports Illustrated Cover Jinx" superstition, in all probability." - you get the idea


etc etc

Last, but certainly not least, the appalling:

"Whatever you call it, though, regression to the mean is a fact of life, and also of sports." -make it stop make it stop


This thing needs a complete rewrite, from the ground up. 219.73.78.161 (talk) 15:28, 26 June 2008 (UTC)

I am not sure the POV template is the right one however. Most of your complaints are more about sourcing (or lack thereof) and prose style, specifically that it sounds like how-to and/or an essay (which WP is not). That said, I agree the article has serious deficiencies. Baccyak4H (Yak!) 14:19, 16 July 2008 (UTC)
It's difficult to source that which seems obvious, and which in general aren't statements of fact but rather mathematical tautologies. I think this article should be rewritten to explain things more clearly, but I can't agree that noting that for example, the distinction between 'progression and time' etc is 'obvious POV', any more than stating 1+1=2 is 'obvious POV', or indeed requires much citing.--Fangz (talk) 18:02, 21 July 2008 (UTC)

Cleanup

I've done some cleanup on the commented-out section; here's the cleaned-up version. Probably it could use more work:

Francis Galton's experiment

The data can be available from [1] and [2]. They are post-processed by listing all 934 children, of which 481 are male. Some children shared the same parents, so they have the same mid-parent height. Galton assumed that the marriage of people is independent of the heights difference. So it is a random mating.

Initial calculation suggests that the means of the two generations are the same, i.e. 69.219 inches (Galton's result is inches, it seems he is wrong in calculation) for the mid-parents(69.316 inches for fathers and 64.002 inches for mothers) and 69.175 inches for the offsprings. Not only the means, but the variances are also the same, i.e. 2.6472 for the fathers, 2.5122 for the mothers(after scaling by a factor of 1.08) and 2.6232 for the sons and 2.54452 for the daughters. But the standard deviation of mid-parent is 1.802 inch, or

of the population. This can be easily explained by the fact that

therefore the variances:

Further investigation suggests that the correlation coefficient between midparent heights and offspring heights is 0.497, which is not very linear.

If we use least-square method to approximate, we will get by using the following Matlab code:

pinv([midparent(1:934) ones(934,1)])*offsprings(1:934)

and obtain

This is illustrated as the blue line in Figure 1.

In fact, if least squares method is used to estimate the slope, it would be equal to

or in this case,

Figure 1 The distribution of the 934 adult children of Galton's experiment against their corresponding midparent heights. The blue line is the least squares approximation result while the brown one is the approximation of the medians of the 11 categories of midparents. The semimajor axis is the principal component of the variable space. It is apparent in the figure that the extreme parents do not give births to as extreme offspring, on average. And those extreme offsprings (cycled stars) are born by ordinary parents.

The ellipse indicates the covariance matrix of the offsprings and mid-parents. It is given by

Numerically,

It is also the contour of the 2-dimensional Gaussian distribution because

So the ellipse is where

The slope of the semimajor axis indicates the ratio of the variances and the length of it is the standard deviation of the principal component. It would be interesting to note the counter-intuitive phenomenon that the slope of the ellipse does not allign with those of the fitted lines, including least-squares and median fitted lines.

Since 0.713 is much smaller than 1.0, therefore we may conclude that for parents of extreme heights, their children will not be as extreme. Galton didn't use least square to approximate, instead he quantized the mid-parent heights into 11 categories, namely, 'Below','64.5', '65.5', '66.5', '67.5', '68.5', '69.5', '70.5', '71.5', '72.5', 'Above'. And for each category, he found out the median of all the offsprings by this category. He drew a line through the medians and found out that the line's slope is about 2/3, perfectly matches the slope predicted by least square method. He therefore concluded that the offspring are not as extreme as their parents, which he termed the law of regression. But this conclusion is easily misunderstood as one may further conclude from this that the height variance will decreases steadily over generations. In fact, the variance(2.58422) of the 934 offspring is almost the same as that of the fathers and the scaled mothers and 2 times that of 'midparents'.

The reason that 'offsprings' height's regress towards mean' is because Galton only analyzed the medians but ignore dthe fact that for each category of mid-parent height, e.g. '72.5', the variance of its offspring is not the same as that of '71.5' nor 'Above 72.5'. For mid-parent approaching population mean, their offsprings' heights are much more dispersed than that of the offsprings of extreme midparents.

For example, for midparent category 'Above 72.5', their offsprings only concentrate on '72.2' and '73.2', whereas for category of '72.5', their offsprings span from '68.2' to 'Above 73.2'. As a result of this, the next generation as a whole will still have the same variance as their fathers and mothers. But for the extreme tall or short offsprings, it is more likely that their midparents are not as extreme as them.

Commented out section

I probably oughtta explain myself a bit here. I realise that various editors have put a lot of work into Galton's example, which I commented out. But I mean, my main problem with that section is that I can't really see what it is adding to the article - to even realise it is an example of regression toward the mean, for example, would require some understanding of the Linear Model, and the relation with regression lines, and specifically how the slope of the line being under 1 is what is important here. The specifics of stuff like the principal component stuff just isn't very relevant to the article, as far as I can tell. Maybe a clarified image with explanation, plus some short comments on the source of the regression towards the mean, would be better. Similarly, with mathematical derivation, I'd like to replace it with something a bit more general, since for example the root 2 issue isn't very important in general.--Fangz (talk) 19:37, 21 July 2008 (UTC)

Possibly it should be made a separate article to explain the history of the idea. Michael Hardy (talk) 19:40, 21 July 2008 (UTC)
Actually, on second thoughts, the problem is a bit broader than the section - I myself am a bit confused about what the linear regression stuff is doing in e.g. the section on mathematical derivation. I suspect the main problem is that we are trying to overgeneralise in the 'ubiquity' section. It seems to be much simpler in the identically distributed but correlated regime, compared to whatever-the-heck 'ubiquity' is aiming at, which seems to try to relax assumptions but ends up assuming not merely normality but also a linear model without obvious explanation. Or maybe it's not simpler, argh. I get the distinct feeling we are overcomplicating something very obvious.--Fangz (talk) 21:54, 21 July 2008 (UTC)

Linear regression definitely belongs in this article. If linear regression is not mentioned, this article is definitely incomplete. But the article doesn't need to be complicated. Michael Hardy (talk) 04:36, 22 July 2008 (UTC)

My Understanding

If one were to test a group of students then they would find that the results fit on a bell curve. Likewise if one were to test the same student 100 times (and assuming they didn't improve) they would find that their results would fit on a small bell curve. Because of this the top 50% is going to have better luck in general then the bottom 50%. Every one on average is going to have average luck next time, and that would be worse for the top 50% and better for the lower 50%

For example: If you look at the top 15 MLB teams (or 50%) on July 1st 2007 they had won 56.2% of their games. Over the rest of the season they won 51.3% of games. This is because most of the teams are average and just had flukes the first half. They still won the majority of the 2nd half games because there are a couple good teams.

In general most of a group are average, and the top half consist mostly of average people with a good day. They will do worse on average the next time because they have average luck usually. 72.42.134.253 (talk) 02:11, 7 August 2008 (UTC)

I'd like to take a crack at a rewrite

But I'm pretty new to editing and I don't want to do it wrong. Here is my plan, please let me know if it offends you.

Summary, needs to be expanded a bit and made clear

Example, needs to be made clearer

History, it's important to mention Galton because he named the effect, and also because it explains the name of regression analysis. But this is of interest to history of statistics, not to regression to the mean. His is not a particularly good example, and his explanation of it uses obscure language and conflates biology and mathematics. The discussion of regression lines is irrelevant, the reader should be referred to the article on regression after being told this is the source of the name.

Ubiquity, I would rename and rewrite this section to distinguish among different effects that are sometimes referred to as regression toward the mean. There is a biological principle, and engineering principle and a mixing principle, in addition to the statistical principle under discussion. The idea is also related to shrinkage, which is important, and also important to distinguish.

Mathematical derivation, I don't want to insult the author, but the steps are trivial algebra that don't start at a natural place nor lead to any insight. Also the use of rho for the regression coefficient usually labeled beta is confusing (since rho is usually the correlation coefficient that the author labels r). The text is confusing. I don't think any math is necessary. I would think about putting in a short theoretical section.

Regression fallacies, I think this is excellent, and an important part of the article. I would shorten the "In Sports" and "In road safety policy" sections and combine them with this. I would number or bullet the list. —Preceding unsigned comment added by AaCBrown (talkcontribs) 20:29, 13 October 2008 (UTC)