Good Stats Make Us Uncomfortable (IO Psychology)

Topic: Organizational Performance, Statistics, Strategic HR
Publication: Harvard Business Review (OCT 2012)
Article: The True Measures of Success
Authors: M. J. Mauboussin
Reviewed By: Megan Leasher

Down2In striving for profitability, companies often rely on key indicators of organizational performance.  Common indicators like sales growth, customer loyalty, and earnings per share often guide strategy decisions and resource allocation.  But sometimes key indicators may not be that “key” after all.  They may have little or no true connection to profitability.

Organizations might not be aware of this and continue to rely on these same measures because they feel as though they matter.  Their intuition overrides everything else and as a result they don’t do the due diligence to determine what actually leads to profit.  They become overconfident, grab onto any numbers that are easily available, and rely on things they have always looked at in the past.  They choose what they like and what feels comfortable.  But they don’t actually analyze.  So how can we know if something truly predicts value?  We cannot leverage something we don’t know.

This article focuses on identifying indicators that serve as true statistical predictors of value.  The author emphasizes that for an indicator to be truly connected to value it must be both predictive and persistent.  Indicators that are predictive demonstrate a statistical link to value; a link strong enough that we feel confident saying there is a connection that has meaning and is not due to chance.  Indicators that are persistent stand the test of time; they reliably show that that an outcome is controlled by applying skill or knowledge, and is not random.

The author advocates several steps in selecting the best indicators of organizational performance.  These steps include defining a clear business objective, developing theories to determine what measures might link to the objective, and statistically testing the relationship between the measures and the objective.

As I read these steps they made complete sense to me, but my data-happy left brain went nuts, thinking about others questions that should be considered.  Like: “What else do we already measure?…Could it matter?” and “What else can we measure?…What else should we measure?” and “What other viewpoints are we not taking into account?” and “What curveballs could come our way?”

Sometimes a meaningful statistic can push us out of our comfort zone.  Actually, sometimes a meaningful statistic should push us out of our comfort zone.  It might not make automatic, inherent sense to us, especially at first.  If all statistics made complete, gut-happy sense to us we wouldn’t need them.  We could always rely on our intuition because it would always be correct.  But statistics are useful because they not only tell us how meaningful things might be related; they can surprise us with the sheer fact of what things might be related.

If a predictor of success isn’t pointing in the direction of success, it’s not a predictor.  It’s simply a number.  And a useless one at that.

Mauboussin, M. J. (October 2012).  The true measures of success.  Harvard Business Review, 46-56.

human resource management, organizational industrial psychology, organizational management





source for picture:

Size Matters in Court? Determinations of Adverse Impact Based on Organization Size (IO Psychology)

Topic: Assessment, Discrimination, HR Policy, Statistics

Publication: Journal of Business Psychology (JUN 2012)

Article: Unintended consequences of EEO enforcement policies: Being big is worse than being bad

Authors: R. Jacobs, K. Murphy, and J. Silva

Reviewed By: Megan Leasher


Adverse impact occurs when neutral-appearing employment practices have an unintentional, discriminatory effect on a protected group. The Equal Employment Opportunity Commission is charged with enforcing all federal legislation related to employment discrimination and adheres to the 1978 Uniform Guidelines on Employee Selection Procedures for “rules of thumb” on inferring whether adverse impact is present.

However, it’s tricky for a plaintiff to present conclusive evidence that adverse impact is present in an organization’s practices.  Statistical evidence is needed to demonstrate whether employment practices are truly discriminating against a protected group.  Jacobs and colleagues investigate a common statistical method, known as “significance testing,” which is often used in courts to demonstrate evidence of adverse impact.  Significance testing compares the difference between the proportion of majority candidates selected and the proportion of protected class candidates selected in an employment decision.  If the test finds the difference between these proportions to be “statistically significant,” courts generally interpret this to mean that adverse impact is present.

This method seems to make sense from a high level, but problems arise when you look under the surface.  The outcome of significance testing is greatly influenced by the number of people who are included in the analysis.  Specifically, the more people you include in a significance test, the greater the likelihood of finding a statistically significant difference between groups. So, if you have an organization with many people included in the analysis, you are much more likely to yield a significant difference between majority and protected groups than you would with a smaller organization with fewer people to include in the exact same analysis.  This is the primary argument of the authors; why do courts use significance testing to demonstrate adverse impact when, by nature of the test, the results would almost always find that big organizations are discriminating and smaller ones are not?

The authors conducted a series of studies to determine sources of differences in adverse impact significance testing.  They found that the number of people included in the analysis was the strongest predictor in whether or not a statistically significant difference was found between groups.  Size  accounted for 49% of the final outcome of the analysis, which was almost five times greater than what any other factor (e.g., score differences on assessment in question, proportion in each group selected, etc.) accounted for.  They also discovered an interesting threshold:  When an adverse impact significance test is conducted with 500 or more people in the analysis, very small differences between the groups’ selection proportions will be statistically significant (yet below 500 these same comparisons would not be significantly different).

These findings support the powerful impact of sample size on determinations of adverse impact via significance testing, but they do not tell us if members of majority and protected class groups are really experiencing systemic, differential outcomes in employment practices.  Unless a statistical method can accurately assess the latter issue, it is meaningless.  This oversimplification leads us to believe that virtually all larger organizations are guilty of discrimination and virtually all smaller organizations are not.  This common practice in courts only serves to make small organizations feel impervious and invincible and leave large organizations running in fear.

The authors close by asserting that regulatory standards should always reflect current scientific knowledge, yet the Uniform Guidelines on Employee Selection Procedures still reflect the science of decades past.  They advocate for not only alternative methods to more appropriately measure adverse impact, but also for a more dynamic definition of adverse impact; one that considers multiple, interactive factors before a determination can be made.  Current practice is supporting the message that to be big is to be bad and to be small is to be nice, which goes directly against the spirit of anti-discrimination legislation.

Jacobs, R., Murphy, K., & Silva, J. (2012, June).  Unintended consequences of EEO enforcement policies: Being big is worse than being badJournal of Business and Psychology

human resource management, organizational industrial psychology, organizational management




source for picture:

Using data to make smart decisions: 1 + 1 = It’s Not That Simple

Topic: Business Strategy, Decision Making, Evidence Based Management, Statistics
Publication: Harvard Business Review (APR 2012)
Article: Good Data Won’t Guarantee Good Decisions
Authors: S. Shah, A. Horne, and J. Capellá
Reviewed By: Megan Leasher
When we were in grade school, we learned that 1 + 1 = 2.  We quickly realized and celebrated the immediate success in figuring out what came after the equal sign.  This celebration built faith; blind faith that we should always believe in the result of an analysis.

But in business, it’s not quite so simple.  We should not automatically rejoice in what we see after an equal sign, because we need to judge what went into the numbers in the first place.  This concept is the focus of a study conducted by the Corporate Executive Board, which classified 5,000 employees at 22 global companies into one of three categories:  Those who always trust analysis over judgment, those who always rely on their gut, and those who balance analysis and judgment together.  The Board advocates the latter “balanced” group, as their research found that this group demonstrated higher productivity, effectiveness, market-share growth, and engagement than those in the other two groups.  However, the Board also found that only 38% of employees and 50% of senior executives fell into this “balanced” group.  Taken together, their findings advocate cultivating both analysis and judgment in decision-making at all levels of organizations.

The authors present several ideas as to how organizations can begin to make a shift toward a culture of applying appropriate insight and judgment to their data analysis.  First and foremost, they argue that data must be made accessible and presented in usable formats that enable analysis.  A dual-focus must be placed on the both the data and the judgment; increase data literacy and statistical expertise while simultaneously training employees how to correctly use the data, encouraging both dialogue and dissent throughout the interpretation.

But this is easier said than done.  You have to know what to trust and distrust in data.  You have to learn if and how metrics support the strategy and growth of an organization.  You have to learn what types of caveats and error can be found within the data.  You have to learn how the data was collected, what might be wrong with the collection process, and what important information might have been ignored.  You have to know how to interpret and proceed when you find that multiple metrics of performance are giving you competing answers; not all data play nice with each other.  You have to know what data is worth analyzing and what data should be abandoned altogether.  Sometimes running away screaming is the appropriate response.

Analysis isn’t just about writing a formula and clicking “run” or “execute” to crunch the numbers.  After all, data without method is just numbers in columns and rows. It’s about a series of critical, incremental, and ethical judgment calls before and after each iteration within an analysis.  Some of the judgment calls come from understanding the content and context of the data, some come from a grounding in organizational and industry knowledge, and some come from an understanding of the past, present, and future strategy of the organization.  And yes,  some judgment calls come from pure statistical knowledge.  The true expertise comes from a constant interplay and interdependence of all of these factors.

Regardless of the challenges presented, the authors are clear that decisions should never be made by data or one’s gut alone; analysis is critical, but so is applying corresponding judgment.

Shah, S., Horne, A., & Capellá (2012, April).  Good data won’t guarantee good decisions.  Harvard Business Review, 23-25.

human resource management, organizational industrial psychology, organizational management



source for picture:

Internet-based Data Collection: Just Do It Already!

Topic: Measurement, Statistics
Publication: Computers in Human Behavior
Article: From paper to pixels: A comparison of paper and computer formats in psychological assessment.
Author: M.J. Naus, L.M. Phillipp, M.Samsi
Featured by: Benjamin Granger

Although many organizations have jumped onto the internet-data collection bandwagon, several issues still need to be addressed. For example, are paper-pencil and internet-based tests of the same trait (e.g., personality questionnaire) or ability (e.g., cognitive ability test) really equivalent? Similarly, are there any reasons to believe that employees respond to internet-based tests differently than they would a paper-pencil test of the same trait or ability?

Naus, Philipp, and Samsi (2008) set out to investigate these questions using three commonly used  psychological scales (Beck Depression Inventory, Short Form Health Survey, and the Neo-Five Factor Inventory).

Although Naus et. al found that the paper-pencil and internet-based survey formats performed equivalently for the Beck Depression Inventory and the Short Form Health Survey, there were differences for Neo-Five Factor Inventory (a commonly used personality assessment tool). What’s going on here?

One possibility is that responses were more socially desirable for the paper-pencil format, since a researcher was present at the time. That is, in the presence of an authority figure (i.e.,  researcher) participants may have responded in order to appear more self-controlled and self-focused. This is likely much less of a concern when completing the same survey on a computer at home (in PJs!).

Overall, respondents perceived the internet-based format to be convenient, user-friendly, comfortable and secure (All great things!). So what can we conclude about these findings? Although internet-based data collection methods have some advantages over paper-pencil methods, there are some caveats to their use. In some cases, the tests may operate differently due to the particular format. Unfortunately, not much is known about how they might differ. However, Naus et al.’s findings suggest internet-based  methods receive good reactions from employees and can save an organization time and money!

Naus, M.J., Philipp, L.M., & Samsi, M. (2008). From paper to pixels: A comparison of paper and computer formats in psychological assessment. Computers in Human Behavior, 25, 1-7.

Is interrater correlation really a proper measurement of reliability?

Topic: Measurement, Research Methodology, Statistics
Publication: Human Performance
Article: Exploring the relationship between interrater correlations and validity of peer ratings
Blogger: Rob Stilson

Interrater reliability (still with me?, Ok good) is often used as the main reliability estimate for the correction of validity coefficients when the criterion is job performance. Issues arise with this practice when one considers that the errors present between raters may not be random, but due to bias, while agreement between raters may also stem from bias instead of actual consistency. In this study, the authors’ main goal was to explore the relationship between interrater correlations and validity and also to explore the relationship between the number of raters and validity.

In order to do this, the authors gathered information from 3072 Israeli policemen from 281 work teams who took part in peer rating. The average size of each of these work teams averaged about 12 people and ranged from 5 all the way to 33. The measure used was overall performance (on a 7-point Likert scale). The predictor employed in this study was the ICC (C,k) model, which is equivalent to Cronbach’s alpha. Measurement indices were computed on the team level as rating only took place within work teams.

The predicted variable for the study was the validity coefficient for each work team. This is the part of the study where you could really feel the sweat involved. Here the authors gathered information on
supervisor evaluations, absenteeism data, and discipline data collected over several years (for over 3000 policemen)! The authors then converted this information into z scores with higher scores indicating better performance.

Results showed a weak positive linear relationship between interrater correlations and the various validity indexes. This is not what you want to hear if you are doing peer rated performance evaluations. The authors’ stipulate that the correlation between raters is a conglomeration of factors
having different theoretical relationships with validity (i.e. bias and other idiosyncrasies).

Practical implications from the information gleaned here include the adjustment of validity due to attenuation. If the measurements used in the calculation included non random error estimates, the ensuing calculations will be off. A positive finding for the work world was validity in small units (less than 10 people) was about the same as those for larger units. The authors’ believe this finding may be due to observation opportunity level, which is seemingly greater in smaller work units.

Kasten, R., and Nevo, B. (2008) Exploring the relationship between interrater correlations and validity of peer ratings. Human Performance, 21(2), 180-197.