A Systematic Review of Statistical Methods Used to Test for Reliability of Medical Instruments Measuring Continuous Variables

Document Type : Original Article


1 1,3Julius Centre University of Malaya, Department of Social & Preventive Medicine, Faculty of Medicine, University of Malaya, 50603, Kuala Lumpur, Malaysia

2 4Department of Applied Statistics, Faculty of Economics & Administration, University of Malaya, 50603, Kuala Lumpur, Malaysia



Reliability measures precision or the extent to which test results can be replicated. This is the first ever systematic review to identify statistical methods used to measure reliability of equipment measuring continuous variables. This studyalso aims to highlight the inappropriate statistical method used in the reliability analysis and its implication in the medical practice.
Materials and Methods:
In 2010, five electronic databases were searched between 2007 and 2009 to look for reliability studies. A total of 5,795 titles were initially identified. Only 282 titles were potentially related, and finally 42 fitted the inclusion criteria.
The Intra-class Correlation Coefficient (ICC) is the most popular method with 25 (60%) studies having used this method followed by the comparing means (8 or 19%). Out of 25 studies using the ICC, only 7 (28%) reported the confidence intervals and types of ICC used. Most studies (71%) also tested the agreement of instruments.
This study finds that the Intra-class Correlation Coefficient is the most popular method used to assess the reliability of medical instruments measuring continuous outcomes. There are also inappropriate applications and interpretations of statistical methods in some studies. It is important for medical researchers to be aware of this issue, and be able to correctly perform analysis in reliability studies.


1. Daly LE, Bourke GJ. Interpretation and Use of Medical Statistics. 5th ed. Oxford: Blackwell Science Ltd; 2000.
2. Bruton A, Conway JH, Holgate ST. Reliability: What is it, and how is it measured? Physiotherapy 2000; 86:94-99.
3. Altman DG, Bland JM. Measurement in Medicine: the analysis of method comparison studies. Statistician 1983; 32:307-317.
4. Hopkins WG. Measures of reliability in sports medicine and science. Sports Med 2000; 30:1-15.
5. Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidiset J PAea. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions:explanation and elaboration. PLoS Med 2009; 6:e1000100. doi:10.1371/journal.pmed.1000100.
6. Medline. Available at: http://ehis.ebscohost.com/ehost/search/advanced? sid= 0f3db84b-1aa7-47a7-a950-4bff8d897e31%40sessionmgr104&vid=1&hid=121.
7. Ovid. Available at: http://ovidsp.tx.ovid.com/sp-3.8.0b/ovidweb.cgi.
8. PubMed. Available at: http://www.ncbi.nlm.nih.gov/pubmed.
9. Scopus. Available at: http://www.scopus.com/home.url.
10. Science Direct. Available at: http://www.sciencedirect.com/.
11. Holzinger U, Warszawska J, Kitzberger R, Herkner H, Metnitz PG, Madl C. Impact of shock requiring norepinephrine on the accuracy and reliability of subcutaneous continuous glucose monitoring. Intensive Care Med 2009; 35:1383-1389.
12. Antona B, Barra F, Barrio A, Gonzalez E, Sanchez I. Validity and repeatability of a new test for aniseikonia. Invest Ophthalmol Vis Sci 2007; 48:58-62.
13. Boyles SH, Edwards SR, Gregory WT, Denman MA, Clark AL. Validating a clinical measure of levator hiatus size. Am J Obstet Gynecol 2007; 196:174.e1-.e4.
14. Shannon H, Gregson R, Stocks J, Cole TJ, Main E. Repeatability of physiotherapy chest wall vibrations applied to spontaneously breathing adults. Physiotherapy 2009; 95:36-42.
15. Ageberg E, Flenhagen J, Ljung J. Test-retest reliability of knee kinesthesia in healthy adults. BMC Musculoskelet Disord 2007;8.
16. Maksymowych WP, Dhillon SS, Park R, Salonen D, Inman RD, Lambert RGW. Validation of the spondylarthritis research consortium of Canada magnetic resonance imaging spinal inflammation index: Is it necessary to score the entire spine? Arthritis Care Res 2007; 57:501-507.
17. Reilly K, Munro J, Pandit S, Kress A, Walker C, Pitto RP. Inter-observer validation study of quantitative CT-osteodensitometry in total knee arthroplasty. Arch Orthop Trauma Surg 2007;127:729-31.
18. Syed FI, Oza AL, Vanderby R, Heiderscheit B, Anderson PA. A method to measure cervical spine motion over extended periods of time. SPINE 2007; 32:2092-2098.
19. Pini C, Pastori M, Baccheschi J, Omboni S, Parati G. Validation of the Artsana CSI 610 automated blood pressure monitor in adults according to the International Protocol of the European Society of Hypertension. Blood Pressure Monitoring 2007; 12:179-184.
20. Pini C, Natalizi A, Gerosa PF, Frigerio M, Omboni S, Parati G. Validation of the Artsana CS 410 automated blood pressure monitor in adults according to the International Protocol of the European Society of Hypertension. Blood Pressure Monitoring 2008; 13:177–182.
21. The Free Encyclopedia. 2011. Available at: en.wikipedia.org/wiki/Ronald_A._Fisher.
22. Weir JP. Quantifying test-retest reliability using the Intraclass Correlation Coefficient and the SEM. J Strength Cond Res 2005; 19:231-240.
23. The Free Encyclopedia. 2011.Available at: http://en.wikipedia.org/wiki/Intraclass_correlation.
24. Rosner B. Fundementals of Biostatistics. 6th ed. Duxbury: Thomson Brooks/Cole; 2006.
25. Shrout PE, Fleiss JL. Intraclass Correlations: uses in assessing rater reliability. Psychol Bull 1979; 86:420-428.
26. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996; 1:30-46.
27. Streiner DL, Norman GR. Health measurement scales. Apractical guide to their development and use. Third ed. Oxford: Oxford University Press; 2003.
28. Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994; 13:2465-2476. Epub 1994/12/15.
29. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986; i:307-310.
30. Fay MP. Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement. Biostatistics 2005; 6:171-180. Epub 2004/12/25.
31. Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999; 8:135-160. Epub 1999/09/29.
32. Neveu D, Aubas P, Seguret F, Kramar A, Dujols P. Measuring agreement for ordered ratings in 3 x 3 tables. Methods Inf Med 2006; 45:541-547. Epub 2006/10/05.
33. de Vet HCW, Terwee CB, Knol DL, Bouter LM. When to use agreement versus reliability measures. J Clin Epidemiol 2006; 59:1033-1039.
34. Zaki R, Bulgiba A, Ismail R, Ismail NA. Statistical methods used to test for agreement of medical instruments measuring continuous variables in method comparison studies: a systematic review. PLoS One 2012;7:e37908. doi:10.1371/journal.pone.0037908(5