Reliability Of VO2 Max Measurements Physical Education Essay

Maximal oxygen intake (VO2max) is the highest rate at which oxygen can be absorbed and use by the body during exercise. It represents the upper limit of exercise tolerance and is commonly used to display physiological effects of training. The accurate monitoring of VO2max is essential in following an athlete’s aerobic fitness and reliable procedures have been established by various Sporting Authorities such as the British Association of Sport and Exercise Science (BASES). This report firstly sets out to discuss the historical development and technical measurements of VO2max, secondly examine the procedure and statistical methods used in this study, thirdly explore the limitations and errors of the experiment and finally analyse the viability of the experiment to report a meaningful change in measurement.

Historical Development

Experiments with VO2max began in 1923 when Hill and Lupton took measurements of their oxygen uptake while running at different velocities on a grass track. They discovered that oxygen intake increases with running speed until a critical speed whereby no further increase in oxygen intake occurs and the cardiorespiratory system has reached a maximum capacity. Beyond this critical speed, the body requires more oxygen than the body can process and an oxygen debt occurs. Fatigue, exhaustion and lactic acid accumulation is also associated with oxygen debt. Such information is important in monitoring the athlete’s training performance and to effectively plan exercise to improve aerobic fitness. In the laboratories, experiments utilise various ergometers to induce exercise and spirometers to measure the volume of expired air and the concentrations of gases in the expired air. Fractional components of carbon dioxide (CO2) and oxygen (O2) in the expired air were then calculated with the Fick equation to determine VO2.

Originally, Douglas Bags attached to a volumetric gas analyser such as the Tissot gasometer was used to measure gases. The Douglas Bag is considered the criterion measure for Spirometry. These heavy, awkward to handle bags caused difficulties in timing the capture of gases and had diffusional errors (Shepard 1955). The next improvement came in the form of a meteorological balloon attached to electronic gas analysers which eventually became semi automated system which used three bags that had to be manually rotated on a spinner. However, the measurement of the ventilation is immediate and gas analysis of the expired gases had a delay due to the movement of gas through the breathing tubes and the analyser’s response times (Davis 1995). This misalignment of times can be overlooked if the concentrations of gas expired is constant or during maximal exercise when large volumes of expired air causes small temporal differences.

Present day analysis uses a breath-by-breath system, a fully automated system which measures expired air continuously and simultaneously calculating CO2 and O2 concentrations. Gases are analysed by mass spectrometry or infrared light absorption and the results are displayed directly on a monitor. This method is convenient and does not require to observer to perform manual calculations, rotate the spinners for gas sampling and eliminates misalignments of time in the gas analysis. Properly calibrated spirometers can have technical errors as low as 10% of the total error of measurement (Katch 1982). At steady-state, VO2max is reproducible even at levels as high as 3.0-4.0L-1.min, with small 1min variations of 0.1L-1.min (Howley et al.1995). These same researchers found coefficients of variation <3% for repeated measurements on participants performing steady-state sub-maximal exercise. According to both Katch (1982) and Howley et al (1995), biological differences, inability of the participant to maintain prescribed work rate and incorrectly set work rates account for a majority of errors.

The most commonly used ergometers are the treadmill and cycle ergometer which are able to engage a substantial proportion of the large muscles and estimate the body’s total aerobic capacity (Shepard 1992), however sport specific ergometers for rowing and kayak have been designed (Gore 2000). Treadmills have adjustable speeds and elevation to increase the exercise intensity whilst cycle ergometers vary the power output by suspending known weights. The exercise intensities are easily calculated from the speed, elevation or loads used. The VO2max values measured on the treadmill are approximately 10% higher than on the cycle ergometer (Davis and Katch 1975) and unlike cycling, running is an exercise that is not affected by poor coordination, inexperience or localised muscle fatigue in the tighs. However, treadmill running is sensitive to excess movement; for example, holding onto the hand railing reduces the work performed by the participant (Wasserman 2005). Injuries are less likely during cycling due to positioning, participants are able to stop independently at exhaustion and it is easier to collect blood samples.

Types of Protocols

There is a wide variety of recognised exercise protocols to determine VO2max which provide flexibility depending on limitations present. The variations in each test include the mode of exercise as discussed earlier between running and cycling, maximal vs. sub-maximal, continuous vs. discontinuous methods, duration of exercise and increments between exercise stages. In addition, there are certain assumptions in the calculations which observers must be aware of

Billat and Koralsztein (1996) and Shepard (1984 & 1992) recommend the direct measurement of VO2 in a maximal test as the criterion measure. A maximal test is characterized by a VO2 plateau despite increments of work. These methods require careful calibration and test procedure to be accurate. However, in situations where participants suffer from cardiorespiratory disease or musculoskeletal disability and are advised not to undertake intensive exercise. Sub-maximal testing assumes a linear heart rate and oxygen consumption relationship and is able to predict VO2max when systematic errors are below 10%, and random errors are less than ±10% (Shepard 1984). Another assumption is that the participants in a sub-maximal test have constant running economy at all exercise intensity (Heyward 2010). A runner who has poor running economy will produce a higher sub-maximal heart rate and the VO2max will be overestimated.

A common variation is the continuity of the test. A continuous test, such as the graded exercise test (GXT) is conducted with no rest between work increments and a discontinuous test allows rest periods of 5-10mins between work intensity. In both tests, the workload is progressively increased until exhaustion is attained. Maksud and Coutts (1971) were able to prove that there is no difference in VO2 measurements between continuous and discontinuous tests. As the discontinuous test takes up to 5 times longer and produces similar results to the continuous test, a continuous test is often preferred.

Increments in ramp protocols are continuous and frequent (every 10 or 20s), causing a linear increase of VO2 with work rate. In contrast the GXT increases workload every 2-3mins and the increments are based on protocols not individual differences. This linear relationship aids the calculations of VO2max and some individuals are able to reach higher exercise intensities compared to the GXT method. The disadvantages are that although the test can be individualised to different participant profile, predetermined maximum work rates must be calculated from previous records or questionnaires and computed to complete all increments in 10min (Heyward 2010). Also, frequent increase of work requires expensive ergometers which are able to make frequent and rapid work increments.

The American College of Sports Medicine (ACSM) recommends total VO2max testing duration of 8-12min (Franklin et al 2010) to enhance the chances of the participant achieving VO2max. Tests which are brief and have rapid increments may not provide sufficient data to be collected and tests which are long and have small increments may end prematurely due to monotony and exercise discomfort.

In response to the above considerations and taking into account the participant’s profiles and laboratory facilities for this reliability study, a maximal continuous GXT no longer than 12min was selected.

Determination of VO2max

The commonly agreed primary criterion of VO2max is a plateau in VO2 despite increasing workload. Taylor et al (1955) recommended a plateau criterion of <2.1ml.kg-1.min-1 for an increase of 2.5% gradient at 11.3km-1. At times, this plateau may not be evident during GXT even though VO2max has been reached or a linear relationship between VO2 and workload in a ramp protocol. In such situations, three other secondary criteria are added to verify the determination of VO2max but should not be used independently of the primary criterion. The first supporting requirement is blood lactate concentration of E?8mmol.L-1 in the first 5mins after exercise, secondly the respiratory exchange a ratio (RER) at the end of the test is E?1.00 and finally the heart rate (HR) at the end of the test of E?90% of age predicted maximum (220-age) (Gore 2000).

Test Subjects

Six participants (2 males, 4 females) aged between 21-25years of between average and above average levels of fitness were recruited from the university population. Prior to the test, the test procedures were explained and equipments were familiarised, also participants completed a written consent and pre-test health questionnaire. Due to time constraints and failure to derive satisfactory values only 3 test/re-test results were achieved.

Before each experiment, it was ensured that participants avoided heavy meals, intensive exercise, alcohol, tobacco, caffeine and other drugs and they were well rested and hydrated. Tests were also conducted at the same time of day to prevent circadian variability and participants wore the same clothes and shoes for each test.

Test Procedure

The protocol chosen for this reliability study is progressive maximal test which consists of two phases. In both phases, the participant has on a nose-clip and a mouthpiece attached to a breath-by-breath analyser to measure VO2 and workload is determined by the speed and gradient of the treadmill. During the tests, the laboratory temperatures were between 18-23EsC and had a relative humidity <70% as recommended by Gore (2000) and was controlled for noise and distractions. An electric fan was used to prevent participants from heat exhaustion.

The first phase is a multi-stage incremental test to determine oxygen uptake, heart rate, and blood lactate responses over a range of treadmill speeds within the participant’s capabilities. The participant exercises for 3min at escalating workloads which are separated by 1min each. The initial workload is sat at 1% gradient and 8-10km.h-1 for women and 10-12km.h-1 for men with workloads increasing at 1km.h-1 during each stage. It is recommended that a minimum of five stages and a maximum of 9 stages are completed. The average heart rate and rate of perceived exhaustion (RPE) are recorded in the last 30s of each stage. On completing the stages, the participant should stand astride the moving treadmill belt and blood lactate concentrations are sampled from the fingertip within 15-30s.

During this time, the speed of the treadmill is raised by 1km.h-1 and the participant continues with the exercise after 1min of rest. This progresses until the subject is close to but not at exhaustion, quantified by blood lactate levels E?4mmol.L-1, and HR within 5-10b.min-1 of the age predicted maximum. The participant then stops for 15min and the second part of the test begins.

This final phase is a maximal test which measures the VO2max. The treadmill speed is set at 2km.h-1 less the final speed in the first phase with an initial gradient of 1%. The participant will gradually attain this speed from rest over 1-2min. This speed is constant in the final phase; instead the gradient is raised 1% every 1min and continues to voluntary exhaustion. Participants were then allowed to recover actively by walking on the treadmill and observed for symptoms of fatigue 30min post-exercise.

The test was repeated for re-test reliability within a period of a month from the initial test. The above test was based on BASES testing protocols (Winter 2007).

Data Analysis
Limits of agreement (LoA)

Test 1

Test 2

MEAN

difference

+2sd of diff

53.8

66

59.9

12.2

-10.10417

51.2

49.3

50.25

-1.9

-10.10417

30.7

33.3

32

2.6

-10.10417

mean =

4.3

sd =

7.2

Table1. Calculations for Limits of Agreement. Units are in mL.kg-1.min-1

Figure1. Limits of Agreement shown by the Bland-Altman plot for test-retest data. The dotted lines represent the 95% limit of agreement and the solid line represents the mean.

Bland-Altman’s plot displays the reference interval for test-retest differences expected for 95% of the sample population. It assumes that there are differences within the population during test-retest. Some of its drawbacks are that it is affected by large measurement errors as the magnitude of the measurements increase (heteroscedascity) (Bland and Altman 2010) and the technical and systematic errors are not represented on the Bland-Altman plot. The results show limits of agreement -4.3±14.4mL.kg-1.min-1. For good repeatability, the mean difference should be zero as the same method was used (Bland and Altman 2010), this study’s mean of 4.3 shows poor reliability. The limits of agreement are wide and tolerate a high variability within individuals. This study is not sensitive to detect small changes in performance and could be due to the lack of normality in the test results (Hopkins 2000).

Technical error or measurement (TEM)

Test 1

Test 2

53.8

66

51.2

49.3

30.7

33.3

No of tests = 6

mean

SD

TEM= a?s(I?D2/2N)

D = difference between repeated measures

N= total number of measurements

TEM (68%)

3.6

TEM (95%)

3.6 x 2 = 7.2

Table2. Calculations for technical error of measurement. Units are in mL.kg-1.min-1

According to the data, future measurements will have to show ± 7.2 mL-1.kg-1.min (95% confidence levels) to conclude that a real change has occurred and is not a result of measurement error (Gore 2000). Regarding technical error of measurement, a large number of measures to obtain a normal distribution are required, which is not possible in this study and caused the wide range in the confidence levels.

Method error (ME)

ME = sd of difference/a?s2

Test 1

Test 2

53.8

66

51.2

49.3

30.7

33.3

45.2

49.5

mean difference

st.dev. of difference

ME=
4.16

means of the two test day mean=

47.383333

CV(%)

Table3. Calculations for coefficient of variation. Units are in mL.kg-1.min-1

Coefficient variation (CV) expresses error as a percentage of the mean and is a dimensionless coefficient which is easy to compare over a range of different tests. The CV however accounts for only 68% of variability (Currell and Jeukendrup 2008). CV assumes that the population is normal and measurements are heteroscedastic and CV <10% are desirable for reliability in laboratory analysis (Atkinson and Nevill 1998), however CV<3% for repeated measurements of VO2max gave been recorded (Howley et al 1995). This study meets the criteria with a CV of 8.8% nevertheless; the assumption of normality will be difficult to fulfil due to the small sample size.

Correlation and regression

Test 1

Test 2

Mean

SD

CV

53.8

66

59.9

6.1

0.101836

51.2

49.3

50.25

0.95

0.018905

30.7

33.3

32

1.3

0.040625

Table4. Calculations for correlation and regression. Units are in mL.kg-1.min-1

Figure2. Plot of correlation

The gradient of 1.0865 indicates a slight bias of higher values in the retest and there is a systematic error of 1.7623 denoted by the y-intercept. The coefficient of determination R2 in this study is 0.9414 and is approaching a perfect fit of R2=1 and demonstrates high repeatability (Atkinson and Nevill 1998). The highest measured value on the plot is also the furthest away from the x=y line and has a strong influence creating the skew, also represented by a large Standard Error of Measurement (SEM) 2.49 mL-1.kg-1.min in one of the participants.

Test 1

Test 2

(Test 1)

(Test 2)

(Test 1)(Test 2)

53.8

66

2894.44

4356

3550.8

51.2

49.3

2621.44

2430.49

2524.16

30.7

33.3

942.49

1108.89

1022.31

6458.37

7895.38

7097.27

Pearson’s Correlation, r =

r =

0.91

Table5. Calculations for Pearson’s correlation coefficient. Units are in mL.kg-1.min-1

Pearson’s correlation coefficient (PCC) is the extent in which two variables are related. In this study, the rE?0.90 and shows a significant positive relationship between test and retest and has a good repeatability (Atkinson and Nevill 1998). In this study, PCC is chosen over Intraclass correlation (ICC) as ICC is appropriate for tests with more than two trials. One disadvantage of PCC is that it cannot detect random or systematic changes in the mean values between test-retest. PCC is also influenced by inter-subject differences (heterogeneity) and hence the sample size but is unaffected by the test but by the participants. Each individual in this small sample size in this study contributes significantly to the PCC.

Limitations

The greatest limitation in the reliability study was the small population. The reasons for this were the short testing period and unexpected occurrences such as, an injured participant and inaccurate readings due to saliva in the breathing tubes. Hence acceptable test-retest values were collected for three participants. This makes the results sensitive to individual test performance and affects LoA, CV, R2 and PCC which assumes normality in the sample size.

An additional factor would be the time between test-retest trials. For some participants there was a time difference of three weeks between testing. Hopkins et al (2001) noted that CV between trials conducted on consecutive days was greater than trials conducted on the same day.

The profile of the test participants played an important role too. The participant who showed the greatest change in the test-retest was a well trained athlete and was tested over a two week period allowing for training effects to affect his VO2max performance. Having the knowledge of his initial test results, his competitive nature would motivate him to achieve a better score in the retest.

Conclusion

Reported CV of VO2max measurements in five Australian laboratories were a mean of 2.2% (Gore 2000), with estimated biological errors yielding 2%, based on findings from Katch (1982) that biological errors account for 90% of variability. The 8.8% CV in this study is comparatively high and has possible biological variability of 7.9%. This high value hints that there has been a large observable change within the participants between test-retest and it could be possible that a real change due to improved fitness in training has affected the results. The skew in the correlation plot due to one participant’s measurement also supports this stand.

Hopkins (2000) stated that a real change is represented by 1.5-2.0 times the CV or 6.24 – 8.32mL.kg-1.min-1 based on this study. This range is considered broad as world class male runners have typical VO2max values of 80-90mL.kg-1.min-1, and require at least a 7% improvement to show a real change. Furthermore, TEM tells us that ± 7.2 mL-1.kg-1.min is required to show a real change. This lack of sensitivity to real change is also demonstrated by the Bland-Altman plot which has a positive mean difference and wide limits of agreement. Despite this, PCC shows a positive relationship in test-retest, however is not able to take into account mean differences.

To conclude, this study is not sensitive enough to detect small changes within individuals to suggest a real change due to its vulnerability of a small sample size and prolonged time between test-rest. It is recommended that this reliability study be repeated for a larger population with no more than one week difference between test-retest.

Word Count- 3174 (main text) + 315 (Appendix) = 3489 Words