• Users Online: 532
  • Home
  • Print this page
  • Email this page
Home About us Editorial board Search Ahead of print Current issue Archives Submit article Instructions Subscribe Contacts Login 


 
 Table of Contents  
ORIGINAL ARTICLE
Year : 2016  |  Volume : 3  |  Issue : 4  |  Page : 180-189

Sample size calculation to data analysis of a correlation study in Microsoft Excel® : A hands-on guide with example


1 Department of Physiology, Maharaja Krishna Chandra Gajapati Medical College, Ganjam, Odisha, India
2 Department of Physiology, Medical College and Hospital, Kolkata, West Bengal, India

Date of Submission13-Oct-2016
Date of Decision05-Dec-2016
Date of Acceptance15-Dec-2016
Date of Web Publication29-Dec-2016

Correspondence Address:
Himel Mondal
Department of Physiology, MKCG Medical College, Ganjam - 760 004, Odisha
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/2348-8832.196896

Rights and Permissions
  Abstract 

Background and Aim: Investigators use correlation study to determine the extent to which two or more variables are related among a single group of people. One big question frequently encountered by investigators is How to calculate sample size? After data collection, it is essential to test the data statistically. Hence, we planned to provide hands-on guide for sample size calculation and statistical analysis of data in Microsoft Excel® 2010 commonly needed for a correlation study.
Methods: An example of a correlation study of body mass index and peak expiratory flow rate was taken. Its execution process from sample size calculation to statistical analysis of data has been explained step by step in Excel® 2010. All formula and functions were tested in Excel® 2007.
Conclusion: Excel® as a tool for statistical analysis of data would facilitate researchers to do all statistical analyses in a single file. Along with correlation study, analysis by t-test and ANOVA was also provided. With this discussion, new investigators would get a fair knowledge about sample size calculation and data analysis of correlation study in Excel® and can use it for their future studies.

Keywords: Correlation study, Excel 2010, sample size, statistical analysis, t-test


How to cite this article:
Mondal H, Mondal S. Sample size calculation to data analysis of a correlation study in Microsoft Excel® : A hands-on guide with example. Int J Clin Exp Physiol 2016;3:180-9

How to cite this URL:
Mondal H, Mondal S. Sample size calculation to data analysis of a correlation study in Microsoft Excel® : A hands-on guide with example. Int J Clin Exp Physiol [serial online] 2016 [cited 2019 Mar 20];3:180-9. Available from: http://www.ijcep.org/text.asp?2016/3/4/180/196896


  Introduction Top


Correlation study

Correlation study is used to determine the relationship between two or more continuous variables among a single group of people. A correlation has direction, and correlation coefficient (r) can be either positive or negative. Zero indicates no relationship between the two variables, and r = 1 or r = −1 indicate a perfect relationship. The strength can be anywhere between 0 and ±1. [1] Correlation gives the degree and direction of relationship between the two variables, whereas the regression analysis predicts the values of one variable on the basis of the other variable. When there is perfect correlation (r = +1 or −1), the two regression lines will coincide or become one straight line. [2]

Essential terminology

Null hypothesis


The null hypothesis states that there is no association between the predictor and outcome variables in the population.

Alternative hypothesis

The proposition that there is an association is called the alternative hypothesis. The alternative hypothesis cannot be tested directly; it is accepted by default if the test of statistical significance rejects the null hypothesis.

Type I error

A type I error (false positive) occurs if an investigator rejects a null hypothesis that is actually true in the population.

Type II error

Type II error (false negative) occurs if the investigator fails to reject a null hypothesis that is actually not true in the population [Table 1]. [3]
Table 1: Null hypothesis, Type I and Type II error

Click here to view


Effect size

Effect size is a quantitative reflection of the magnitude of some phenomenon that is used for the purpose of addressing a question of interest. The investigator should first try to find data from prior studies in related areas to estimate reasonable effect size. When data are not available, a small pilot study may be done. [4]

α: The significance level of a test; the probability of rejecting the null hypothesis when it is true (or the probability of making a type I error).

β: The probability of failing to reject the null hypothesis when it is false (or the probability of making type II error).

Confidence level

The probability than an estimate of population parameter is within certain specified limits of the true value; commonly denoted by "1 - α."

Power of test

The probability of correctly rejecting the null hypothesis when it is false; commonly denoted by "1 - β." [5]


  Materials and methods Top


Commonly, researchers save raw data in Excel® and do statistical analysis in other software. It may be beneficial for them if they can do statistical analysis in the same Excel® sheet. Hence, we have discussed about the procedure of statistical analysis of data in Excel® . Screenshots of Windows were captured by Greenshot 1.2.7 Build 2 software for Windows. Though the screenshots were taken while executing analysis in Excel® 2010, we have tested all formula and functions in Excel® 2007, and all functions and formulae were compatible with Excel® 2007.

Designing a correlation study

Our research question was - whether body mass index (BMI) is correlated with peak expiratory flow rate (PEFR)? Similar previous studies with correlation of BMI and PEFR were available and had been studied beforehand. Among those studies, one study was almost similar regarding the age, race, and other characteristics of participants of our aimed study. That study found a correlation coefficient (r) = 0.30. Hence, we had anticipated in our study that BMI would be correlated with PEFR as that study.

Our null hypothesis

There is no correlation between BMI and PEFR in the study population.

Alternative hypothesis

There is a correlation between BMI and PEFR in the study population.

α: (two sided) we needed only 5% chance of rejecting null hypothesis when it is true. Hence, "α" value was 0.05. Confidence level was 1 − 0.05 = 0.95 or 95%.

β: We needed 5% probability of failing to reject null hypothesis when it is false. Hence, "β" value was 0.05 and power of test was 1 − 0.05 = 0.95 or 95%.

Effect size

From the previous study, we found correlation coefficient (r) = 0.30 between BMI and PEFR. Hence, our effect size was \0.30\ = 0.30.

The formula for sample size calculation for correlation study is as follows: [6]

N = ([Zα + Zβ]/C) 2 + 3

where, N = number of subjects required

Zα = the standard deviation for α

Zβ = the standard deviation for β

C = 0.5*ln ([1 + r]/[1 − r])

r = expected correlation coefficient

Sample size calculator

A new Excel® sheet was opened. Three cells were selected for texts - "α=," "β=," and "expected r =." Immediate right adjacent cells were colored for easy identification. We used those cells for inserting values for α, β, and r. On the right side of those cells, a cell text was written as "Zα=," and the adjacent right cells' formula was written as "=NORMSINV((1−(cell address of α/2)))." In our example, it was "=NORMSINV((1− (B4/2)))" [Figure 1]. As "α" was "two sided," we had divided it by 2. In similar fashion, "Zβ=" cells were selected and the formula was written as "=NORMSINV((1 − cell address of β))." In the example, it was "=NORMSINV((1 − B6))." Another cell text was written as "C=" and the adjacent right cells' formula was written as "=0.5*LN ((1 + cell address of expected r)/(1 − cell address of expected r))." In our example data sheet, it was "=0.5*LN ((1 + B8)/(1 − B8))" [Figure 2]. For the cell right side to "Sample size=," formula was written as "= (((cell address of Zα + cell address of Zβ)/cell address of C)^2) +3." In the said example, it was "= (((E4 + E6)/E8)^2) +3" [Figure 3]. For making the number as an integer, the formula is "=ROUND (cell address of sample size, 0)." In the example of [Figure 3], it was "=ROUND (E11, 0)." With this calculator, after inputting values of α, β, and r, we found that the number of participants required for our study was 139.
Figure 1: Screenshot showing calculation of Zƒ¿ in cell E4 with the value of cell B4 in Excel® 2010

Click here to view
Figure 2: Calculation of C in cell E8 with the value of cell B8

Click here to view
Figure 3: Calculation of sample size for correlation study in cell E11 with the calculated value of cells E4, E6, and E8 based on the equation: N = ([Zƒ¿ + ZƒÀ]/C) 2 +3

Click here to view


This calculator can be saved and used for sample size calculation of any correlation study with the effect size being calculated by correlation coefficient (r). Only α, β, and r values are required in specified cells, and the sample size will be calculated according to the formula we used.

Data entry

We had taken the following data (1) height in cm to the nearest 0.1 cm, (2) weight in kg to the nearest 0.1 kg, and (3) PEFR in L/min. In an Excel® sheet, the 1 st column (column B) label was written as "Weight (kg)," in first row and in the first row, the 2 nd column label was written as "Height (cm)," the 3 rd as BMI, and the 4 th as "PEFR (L/min)," [Figure 4] and the data were entered accordingly. For ease of description, only eight sets of data were entered.
Figure 4: Calculation of body mass index in cell D2 with the value of cells B2 and C2 on the basis of Quetelet's equation: body mass index = Weight in kg/height in m2

Click here to view


Calculating body mass index, mean, standard deviation and standard error

For calculation of BMI according to Quetelet's equation, [7] we had to convert the "Height in cm" to "Height in meter." A cell below the "BMI" cell was selected and the formula was written as "=cell address of weight in kg/(cell address of height (cm)/100)^2." In the example of [Figure 4], it was "=B2/(C2/100)^2."

Arithmetic mean is the average. Hence, for the calculation of mean of the data of the column B, the cell equation was written as "=AVERAGE (starting cell address of column B: last cell address of column B)." For cell B10, it was "=AVERAGE (B2:B9)" [Figure 5]. For calculation of standard deviation (SD), the cell equation used with STDEV function was "=STDEV (starting cell address of column: Last cell address of column)." Here, it was "=STDEV (B2:B9)" in cell B11. For calculation of standard error (SE), we have to use the formula SE = SD√N, [1] where N is the sample size. If we use this formula in Excel® for the data of column B in the example of [Figure 5], it will be "=STDEV (B2:B9)/SQRT ((COUNT (B2:B9)))." We can do it more simply as we already calculated the SD and we know the N. We can obtain the SE by writing a cell formula as "=B11/SQRT(8)" also.
Figure 5: Calculation of mean of values of cell B2 through B9 in the cell B10 with the "AVERAGE" function

Click here to view


Calculation of Pearson's correlation coefficient (r)

For calculation of Pearson's correlation coefficient (r), we had written a cell text as "r=," and the formula for adjacent right cell was written with PEARSON function as: "=PEARSON (first cell address of first column: Last cell address of first column, first cell address of second column: Last cell address of second column)." Here, it was "=PEARSON (D2:D9, E2:E9)." In the example of [Figure 6], the value for "r" was calculated in cell H2 and the value was 0.675689.
Figure 6: Calculation of Pearson's correlation coefficient (r) of body mass index (column D) and peak expiratory flow rate (column E) data in cell H2 with the "PEARSON" function

Click here to view


Alternatively, we could use the "Data Analysis Tools" to get the correlation coefficient. For correlation analysis, we have to follow the steps: Data tab (on ribbon)&$8594; Data Analysis Tools (on the extreme right side under data menu)&$8594;Correlation&$8594;Ok&$8594; Input range ($D$2:$E$9 for the example) and output range (any cell where the result to be shown, e.g., $A$13) &$8594; Ok.

Under "Data" tab (on ribbon), if "Data Analysis Tools" is not shown, we have to follow the steps: File &$8594; Option &$8594; Add Ins &$8594; Select "Analysis ToolPak" &$8594; Click on "Go" &$8594; Click "Ok" in new window [Figure 7]. Now, we have enabled data analysis tool for Excel® . [8]
Figure 7: Serial screenshots showing from left to right: After clicking "File," "Options" to be clicked; in "Add‑Ins," "Analysis ToolPak" is to be selected and "Go" button to be clicked; "Analysis ToolPak" is ticked and "Ok" is to be clicked

Click here to view


Calculation of "P value"

We needed to test whether the result was statistically significant or not. For calculation of "P value" from the "r" value, we had used the formula "=TDIST ((r*SQRT (n-2))/(SQRT (1 − r^2)), n − 2,2)" for two-tailed "P value." Calculation of "P0" from "r" is not allowed in Excel® , so we had incorporated conversion of "r0" to "t" in the Excel® formula according to the conversion formula: T = r(n − 2)/√(1 − r2 ) [1]

where r = correlation coefficient, n = number of participants, SQRT = square root function as written in Excel® , last 2 in formula was for 2 tail P.

In our example, n = 8 and r = 0.675689, we had not put the value of "r," we had written the cell number of "r0" instead. In the example of [Figure 8], it was H2. We had written a cell text as "P value=," and on the right adjacent cell, we had written the formula as "=TDIST((H2*SQRT (6))/(SQRT (1-H2^2)), 6, 2)." Calculated two-tailed "P value" for the example data was 0.065879 in cell H4.
Figure 8: Calculation of two‑tailed "P value" in cell H4 with the help of calculated value of "r" of cell H2 with the Excel® function "TDIST"

Click here to view


Solution for problems with negative "r" value: Where the "r0" value is negative, we have to convert it to its absolute value of number (ignoring the negative sign) before using it in the calculation of P from r. To do this in any cell with ABS function, we have to write the formula for the cell as "=ABS (cell address of r)." This will show the absolute value in the selected cell.

Solution for scientific expression of number: When the value of "P" is far less than zero (frequently encountered), the number in Excel® is expressed in scientific format (e.g., 4.567E-23). To show it in decimal format, we have to right click the cell where the number is present and have to select "Format Cells…." Under "Number" tab, we have to select "Number" (below "General") and increase the "Decimal places" until the number is shown after zeros in "Sample" and have to click "Ok." If the cell width is automatically increased to accommodate the number, it is ok. Otherwise, increase the width manually to view the number.

Regression analysis

In experimental science, after having understood the correlation between two variables, there are situations when it is necessary to do regression analysis. [1] For regression analysis, we had selected "Data&$8594; Data Analysis Tools&$8594;Regression&$8594;Ok&$8594; Input range and output range &$8594; Ok."

In the example, we put "BMI" in "X range" and "PEFR" in "Y range." For that, we had selected the "BMI" data column in "X range" as: "$D$2:$D$9" and "PEFR" data column in "Y range" as: "$E$2:$E$9" [Figure 9]. We could change the "Confidence Level," but it was 95% by default, so we kept it. We selected "Output range" in cell I1. The regression analysis result is shown in [Figure 10]. The equation y = α + βx is defined as the linear regression equation, where α is the intercept and β is the regression coefficient.
Figure 9: Screenshot during regression analysis of data of column D (Input Range X) and column E (Input Range Y) with 95% confidence interval. Output range is on cell I1

Click here to view
Figure 10: Summary output of regression analysis. Important parameters are colored. Equation: Y =145.75 + 3.972*X can be written from the data of J17 and J18

Click here to view


In the example, we found: R2 = 0.4566; SE: 16.956; F: 5.041; significance of F: 0.0659; and the equation was "Y = 145.75 + 3.972*X."

The Student's t-test

The t-test is commonly used to determine whether the mean value of a continuous outcome variable in one group differs significantly from that in another group. [9] The t-test was not suitable for the study example we were discussing, but learning t-test procedure is a worth learning for any statistical analysis. Procedure is same as any "Data Analysis" we discussed above. We just need to select "t-test" from scrolling down after selecting "Data Analysis Tools" and input the range of data column in "Variable 1 Range" and "Variable 2 Range." In the example shown in [Figure 11], we wanted to carry out t-test for BMI of male (column A) and female (column B). We input "$A$2:$A$9" in "Variable 1 Range" and "$B$2:$B$9" in "Variable 2 Range." We have selected "$D$3" as output range. Values for t-Stat and P have been highlighted with color in the result of t-test in [Figure 11].
Figure 11: Screenshot of result of t‑test of column A and column B by using Data Analysis Tool (output range is D3). Calculation of only two‑tailed "P value" by Excel® function "TTEST" is shown in cell E1

Click here to view


If we want to get only "P value" without other data analysis parameters, we can use the Excel® function "TTEST" and write the formula as "=TTEST (A2:A9, B2:B9, 2, 2)" as shown in the example provided in [Figure 11]. Here, the "P value" has been calculated in cell E1. See the formula; after the array 1 and array 2, the first "2" is for two-tailed test and the second "2" is for t-test assuming two samples of equal variance. We can change it according to our choice. If it is one tail and assuming two samples with unequal variance, the formula would be "=TTEST (A2:A9, B2:B9, 1, 3)."

ANOVA

In the example shown in [Figure 12], we have systolic blood pressure data of three groups of patients of underweight, normal weight, and overweight. We have to use the Data Analysis Tools for ANOVA test. Here, we have selected the "ANOVA: Single Factor" and selected input range as "$A:$C" and output range as $E$1. As our first row contained labels, we had ticked the "Labels in the first row." Result is shown on the sheet as shown in [Figure 12]. This example is for one-way ANOVA. For two-way ANOVA, we have to select "ANOVA: Two-Factor with Replication." For two-way ANOVA, we have to tell Excel® about the "Rows per sample." Rest of the procedure is the same.
Figure 12: Analysis result of one‑way ANOVA for the data of columns A, B, and C (output range is E1)

Click here to view


Regression scatter plot

Graphical representation of regression analysis data gives a better understanding of the trend. In the example of [Figure 13], we have shown a scatter plot of regression analysis of column A (BMI) and column B (Fat%) data. For inserting the scatter plot on Excel® sheet, we have to click "Insert" tab on ribbon and in "Charts" section, "Scatter" to be clicked and "Scatter with only Markers" is to be selected. A blank graph plot will appear on the excel sheet. "Right click" on any area of the blank plot will give us option to "Select Data…." Clicking on "Select Data" will lead us to "Chart Data Range" selection. Column A and Column B have to be selected (Press and hold "Ctrl" and click on A and B column), and the "Chart Data Range" will appear as "=Sheet1!$A:$A, Sheet1!$B:$B." Alternatively, we can input the range as "=Sheet1!$A$1:$B$9" for the data in the example. Clicking on "Ok" will show the graph. Now, we have to right click on any series point (blue diamond) and select "Add Trend line…" and in "Trend line option," "Linear" is selected; "Display equation on chart" and "Display R-squared value on chart" are selected and the window is closed. Regression scatter plot with trend line will be shown with equation and R2 value. In this example of data [Figure 13], BMI is on X axis and Fat% is on Y axis. The linear relation equation is "Y = 1.5948*X − 10.408."
Figure 13: Regression scatter plot for column A and B data, body mass index is on X axis and Fat% is on Y axis

Click here to view



  Discussion Top


Adequate sample size has paramount importance in assessing the accurate results, while less or more than required sample size has many disadvantages. [10] Sample size must be calculated during the planning stage of the study. Moreover, it should be done with reference to previous studies or a pilot study. [11] There are websites which help in the calculation of sample size [12],[13] free of cost. However, the commonly used software for statistical analysis such as IBM SPSS Statistics (New Orchard Road Armonk, New York, USA) [14] and GraphPad Prism (GraphPad Software, Inc., CA, USA) [15] are paid software. It may be difficult for researchers in small settings to get that software. On the other hand, Microsoft Office® is widely used software in educational institutions and offices. Excel® is a part of Microsoft Office® software package. Knowledge of Function, Formula, and Data Analysis Tools in Excel® is necessary for the analysis of data. We have discussed a simple example of a correlation study for better understanding of the procedure of a correlation study.


  Conclusion Top


This article would help new researchers to calculate the sample size for correlation study and statistical analysis without using any additional statistics software. They get hands-on experience on making sample size calculator, calculation and analysis of data in Excel® . Additional description of t-test and ANOVA would help them in data analysis of other studies. They can utilize this article as a reference tutorial for future studies.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

 
  References Top

1.
Khanal AB. Correlation and regression. In: Mahajan′s Methods in Biostatistics for Medical Students and Research Workers. 8 th ed. New Delhi, India: Jaypee Brothers Medical Publishers (P); 2016. p. 222-31.  Back to cited text no. 1
    
2.
Machin D, Campbell MJ, Walters SJ. Correlation, linear and logistic regression. In: Medical Statistics: A Textbook for the Health Sciences. 4 th ed. England: John Wiley & Sons Ltd.; 2007. p. 150.  Back to cited text no. 2
    
3.
Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. Getting ready to estimate sample size: Hypotheses and underlying principles. In: Designing Clinical Research. 3 rd ed. Philadelphia, PA: Lippincott Williams and Wilkins; 2001. p. 55-9.   Back to cited text no. 3
    
4.
Kelley K, Preacher KJ. On effect size. Psychol Methods 2012;17:137-52.  Back to cited text no. 4
    
5.
Lwanga SK, Lemeshow S. Definitions of commonly used terms. In: Sample Size Determination in Health Studies: A Practical Manual. Geneva: World Health Organization; 1991. p. 21-2.   Back to cited text no. 5
    
6.
Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med 2012;31:3972-81.  Back to cited text no. 6
    
7.
Eknoyan G. Adolphe Quetelet (1796-1874) - The average man and indices of obesity. Nephrol Dial Transplant 2008;23:47-51.  Back to cited text no. 7
    
8.
McFedries P. Analyzing excel data. In: Teach Yourself Visually TM Excel® 2010. Indiana: Wiley Publishing Inc.; 2010. p. 242-3.  Back to cited text no. 8
    
9.
Gravetter FJ, Wallnau LB. Introduction to the t statistic. In: Essentials of Statistics for the Behavioral Sciences. 8 th ed. USA: Cengage Learning; 2014. p. 252.  Back to cited text no. 9
    
10.
Kumar GS. Importance of sample size in clinical trials. Int J Clin Exp Physiol 2014;1:10-2.  Back to cited text no. 10
  Medknow Journal  
11.
Kim J, Seo BS. How to calculate sample size and why. Clin Orthop Surg 2013;5:235-42.  Back to cited text no. 11
    
12.
California: UCSF Clinical & Translational Science Institute. Available from: http://www.sample-size.net/correlation-sample-size/. [Last updated on 2016 Jun 02; Last cited on 2016 Sep 12].  Back to cited text no. 12
    
13.
Hong Kong: Centre for Clinical Research and Biostatistics, the Chinese University of Hong Kong. Available from: https://www2.ccrb.cuhk.edu.hk/stat/other/correlation.htm. [Last cited on 2016 Sep 12].  Back to cited text no. 13
    
14.
IBM SPSS Statistics for Windows, Version 22.0. Armonk, NY: IBM Corp.; 2013 Release. Available from: http://www.ibm.com/analytics/us/en/technology/spss. [Last cited on 2016 Sep 13].  Back to cited text no. 14
    
15.
GraphPad Prism Version 6.00 for Windows. La Jolla California USA: GraphPad Software. Available from: http://www.graphpad.com. [Last cited on 2016 Sep 13].  Back to cited text no. 15
    


    Figures

  [Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9], [Figure 10], [Figure 11], [Figure 12], [Figure 13]
 
 
    Tables

  [Table 1]



 

Top
 
 
  Search
 
Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

 
  In this article
   Abstract
  Introduction
   Materials and me...
  Discussion
  Conclusion
   References
   Article Figures
   Article Tables

 Article Access Statistics
    Viewed7371    
    Printed49    
    Emailed0    
    PDF Downloaded671    
    Comments [Add]    

Recommend this journal


[TAG2]
[TAG3]
[TAG4]