Fitness testing (i.e., testing the components of fitness related to sports performance and injury) helps drive physical preparation strategies to improve performance and mitigate injury risk (McGuigan, 2017). Fitness testing forms a key component of the need’s analysis and the assessment of the individual, while understanding the physical demands and key performance indicators, helps drives fitness testing selection (McMahon et al., 2018). As summarised in Figure 1 below, fitness testing plays an integral role in identifying strengths and weaknesses of athletes which help inform exercise prescription and training interventions. We then monitor the success of the intervention by retesting the fitness qualities over a period of time. This process is then repeated multiple times throughout the season to try and optimise performance.
In this blog, we will describe and rationalise the importance of reliability and validity during fitness testing data collection, identify methodological factors which can impact reliability and validity during testing, and select suitable tests to evaluate relevant fitness components.
Fitness Testing Rationale
Fitness testing is performed for a plethora of reasons outlined below:
- Create an athlete profile / benchmarking
- Identify strengths and weaknesses
- Inform exercise prescription
- Monitor the effectiveness of training / interventions
- Talent identification
- Injury risk screening
- Return to play (from injury) decisions
- Identify neuromuscular fatigue
To accurately assess performance and fitness characteristics of athletes, we must select tests that are valid, reliable, and sensitive enough to detect “real” changes in performance (Mundy and Clarke, 2018). Test results are only useful if the test actually measures what it is supposed to measure (validity) and if the measure is repeatable (reliability) (McGuigan, 2017).
- Validity – the ability of a test to measure accurately, with minimal error, a specific fitness component. Does the specific test measure what it is supposed to measure?
- Reliability – the ability of a test to yield consistent and stable scores across trials over time. How repeatable is the performance?
For example, in Figure 2 above, a practitioner wants to examine the reliability of an athlete’s mass using some new bathroom scales. The athlete weighs 80 kg as established using the gold standard (hydrostatic weighing), but the practitioner wants to compare these results to the new bathroom scales. In the first scenario (Figure 2, left), they step on the scales 3 times, and register 3 inconsistent scores (two greater than 80 kg, one less) indicating that the scales are not reliable nor valid. In scenario 2, (Figure 2, middle), they step on three times and achieve three consistent (reliable) but higher scores (not valid). However, in some scenarios, this may not be too problematic if the error (overestimation / underestimation) is consistent, which could be factored in when longitudinally monitoring and comparing data to other devices. Finally, the optimal scenario (Figure 2, right), the athlete steps on the scales three times and achieves consistent and accurate measures, confirming that the scales are reliable and valid. Table 1 provides an overview of key terms related to validity and reliability of measurement tools.
When collecting fitness data within a testing session, we normally perform multiple trials to attain a general overview of an athlete’s performance, known as within-session reliability (McGuigan, 2017); however, this will be dependent on the fitness quality examined. For example, non-fatiguing assessments are easy to permit multiple trials to be collected, such as collecting 3 vertical jump trials which is considered easy and feasible for an athlete to perform (Comfort et al., 2018). However, tasks which are highly fatiguing and require exercising to volitional exhaustion would generally only involve one trial within-session to be collected as this would be considered safe and appropriate (Comfort et al., 2018). For example, it would be non-sensical to conduct two VO2max tests directly after each other for an athlete in the same day.
Alternatively, we can repeat the fitness test over multiple sessions (generally 2-7 days apart) to determine whether if the scores are stable and repeatable between sessions. This is known as between-session reliability, and this is considered a stronger form of reliability than within-session (McGuigan, 2017; Mundy and Clarke, 2018). Notably, it is very difficult for an athlete to fully replicate their exact performance between trials / day. This difference in performance between trials or days is known as variation / noise / typical error / measurement error (McGuigan, 2017; Mundy and Clarke, 2018) and can be attributed to internal (biological variation) and / or external sources of error (e.g., equipment). An example of within- and between-session variation in vertical jump height is illustrated in Figure 3 below.
Typical Error of Measurement
- Provides a direct measure of the amount of error associated with the test.
- Various methods of calculating this error, but a simple individual approach is to calculate the coefficient of variation (CV%) (Figure 4):
Coefficient of Variation (CV%)
- Represents the amount of variability or error for the specific variable associated with the test (McGuigan, 2017).
- Standard deviation (SD) is a measure of variability of a set of results around the mean.
- CV% = SD / mean × 100 for the specific metric of interest.
- Generally, we aim for a CV% ≤ 10%, but ideally this error should be as low as possible (McGuigan, 2017).
- A “real” change is considered a change in score > SD or CV%. Ideally, this change will be 1.5-2 × greater than the SD or CV because we can then be 95% certain a “real” change has occurred (McGuigan, 2017).
“When working with elite athletes, the smallest increases can make the difference between a medal or not. As such, we aim to try to reduce the error as much as possible so our testing can be sensitive enough to detect ‘real’ and ‘meaningful’ changes in performance.”dos’santos (2023)
In Figure 4 above, the athlete displays stable, consistent performance across the trials with a CV% of 1.5% which is considered very low. Going forward, when implementing an intervention, the change in jump height must exceed the CV% (for athlete A that would be 1.5%) for it to deemed a “real” or “true” change in performance (McGuigan, 2017; Mundy and Clarke, 2018). Conversely, athlete B displays inconsistent scores across trials, resulting in a larger CV% which would be considered unacceptable. Additionally, as stated before, a change in performance must exceed the CV%. When working with elite athletes, the smallest increases can make the difference between a medal or not. As such, we aim to try to reduce the error as much as possible so our testing can be sensitive enough to detect “real” and “meaningful” changes in performance (McGuigan, 2017).
Factors Affecting Validity and Reliability
A range of internal and external factors can affect the validity, reliability, and measurement error associated with fitness testing and these are presented in the Figure 5 below (McGuigan, 2017). Importantly, however, practitioners can put provisions in place to control for these factors (McMahon et al., 2018; Comfort et al., 2018).
Every attempt should be made to standardise and keep the above factors consistent when testing athletes (Figure 5). For example, for best practice, the below considerations should be applied with consistency so that any observed changes can be attributed to adaptation, and not methodological inconsistencies (Van Winckel et al., 2014).
- Time of day
- Duration since last training session/match
- Equipment used
- Surface (an indoor surface is preferable for consistency)
- Standardised warm up preceding testing
- Temperature and humidity
- Players motivation
As sports scientists, we have a duty of care to ensure client welfare and safety (Atkins, 2018). As such, before we undertake any testing with our client, there are pre-testing requirements related to the participant, equipment/environment and test familiarisation which should be adhered to, and these should conform with the professional standards of national governing bodies (Atkins, 2018).
- Pre-test preparation – ensure that the athletes has had sufficient nutritional intake, adequate rest and is well hydrated before commencing testing. Typically, athletes should abstain from high exertion physical exercise 48 hours prior to testing.
- Explain the purpose of testing of the testing to the client and explain the associated risks / benefits
- Client completes PAR-Q which satisfies all relevant criteria
- Obtain relevant medical and lifestyle history information
- Obtain informed consent from the client
Equipment / environment:
- Calibrate equipment (where necessary) and ensure they adhere to manufactural recommendations
- Organise equipment / test order
- Maintain Room Temp (20/22 degrees C) if possible, for indoor environments. Awareness of external temperature and weather conditions when testing outside.
- Risk assessment to be carried
- Qualified first aider present at testing and awareness of relevant health and safety protocols (fire exits etc.)
Test familiarisation (both participant and experimenter):
- Ensure that adequately trained personnel conduct the testing
- If possible, ensure that the client is familiarised with testing
Fitness testing can be conducted in laboratory or field-based environments, and the type of testing approach will be dictated by the needs analysis, fitness quality to be examined, number of athletes to be tested, and accessibility and financial factors related to time constraints and equipment. For example, if working with an individual elite cyclist, laboratory testing will most likely be feasible for this athlete where greater insight into the neuromuscular and cardiorespiratory function can be obtained. Conversely, when working with a large squad of athletes, with limited time for testing, field-based testing will be more appropriate and feasible to permit large mass testing of athletes. However, practitioners may use a combination of laboratory and field-based tests for a holistic overview of an athlete’s fitness profile.
The advantages and disadvantages of laboratory and field-based testing are presented in Table 2 above. Ideally, where possible, the gold standard test should be utilised to obtain accurate and reliable data, but these are normally restricted to laboratory environments and not normally accessible to most practitioners, nor logistically feasible with large squads (McMahon et al., 2018). Where possible, practitioners may decide to use more accessible and readily available field-based test alternatives which have been validated against the “gold standard” (McMahon et al., 2018). For example, it is impractical to conduct individual VO2max tests for a rugby team, but a field-based CRF test such as the 30-15 intermittent fitness test will permit the whole squad to be tested simultaneously, and importantly provides a cheaper, accessible and validated measure of CRF (NSCA, 2016; Laursen and Buchheit, 2019).
Practitioners must also consider specificity when selecting their tests. For instance, testing on sport-specific surfaces and using sport-specific equipment may increase athlete motivation and adherence, and may provide greater insight into sports specific performance. But generally, sports-specific testing is generally restricted by the amount of information that can be obtained vs laboratory testing which utilises more sophisticated testing equipment. Additionally, the logistics associated with fitness testing must be considered: How many participants? How much time do you have to test and analyse? How much money do you have to spend on equipment? These are all important questions to ask when designing and implementing a testing battery for athletes. It is worth noting, however, that technology is continually developing and becoming financially more accessible. For example, applications on smartphones / tablets can be used to examine the vertical jump / power capabilities simply by using the high-speed recording capabilities via a cheap application (e.g., MyJump 2).
Depending on the sport, the national governing body may require you to conduct mandatory fitness testing at certain points of the season, so that may dictate some of the testing required for the athletes. For example, the EPPP (Elite Performance Player Plan, 2012) requires all premier league / category soccer teams in England to conduct mandatory fitness tests (3 times a year) for national benchmarking across the u9s-u23s, in addition to training load and injury surveillance monitoring.
Generally, testing frequency will depend on a myriad of factors such as the specific sport, the time of the year, access to equipment, athlete buy-in, fixture congestion, and approval from the head coach (to name a few). Normally, fitness testing will take place 3–4 times a season, usually at the beginning of pre-season to identify strengths and weaknesses to help inform training priorities for the preparation phase (McMahon et al., 2018). If possible, testing should take place twice within a 2-7 day period during the pre-season to examine the between-session reliability and establish the measurement error for each test and athlete. This will help establish what is considered a “real” and meaningful” change in performance and establish if there are any learning effects for a more representative overview of neuromuscular performance (Mundy and Clarke, 2018). Any subsequent testing from this point should be standardised (Figure 5) and ideally follow, where possible, the initial testing procedures to ensure any changes in performance can be attributed to training-induced adaptations or fatigue, rather than changes attributed testing or protocol variations. For example, testing should take place at the same time of day to control for circadian rhythm and potentially coinciding with competitive performance times to simulate conditions (McMahon et al., 2018).
Next, where possible, testing should then take place at the end of the preparation phase to monitor any changes, compare against benchmark data, and assess the athletes’ preparedness as they engage in competitive fixtures (McMahon et al., 2018). Then, testing should ideally take place at some point during the competitive schedule (i.e., mid-season) to monitor the impacts of the competitive season on fitness characteristics. If possible (depending on the sport), this should coincide with a scheduled break in competition (i.e., some sports have winter breaks). Finally, a final testing session is normally conducted prior to the end of the competitive schedule to provide some normative data prior to the off-season, and again to examine the effects of the competitive schedule on fitness characteristics. Please note that this a just a guide, and there are numerous strategies of scheduling testing. For example, in-season, the number of tests may reduce, and only few tests might be selected and administered to reduce the time spent testing where performance training will take greater priority. Additionally, some practitioners may wish to conduct some form of testing at end of every mesocycle, such as an isometric strength or vertical jump assessment as these are time-efficient and can be easily integrated into resistance training sessions (Comfort et al., 2018). What must be acknowledged is that time used for testing is time they could be spent training (physically or tactically/technically), so it is highly important that the testing performed is valid and reliable, and will meaningfully inform practice to optimise performance and mitigate injury risk (Atkins, 2018).
Continual Testing and Monitoring:
Due to the issues and limitations of structured testing and dedicating testing sessions / days within the annual plan (Comfort et al., 2018), other methods to assess fitness qualities include continual monitoring strategies and the use of “continual testing” assessments. For example, practitioners may assess athletes’ vertical jump performance via a force plate as an indirect measure of neuromuscular fatigue, preparedness and readiness to train (Comfort et al., 2018). Performing this daily, however, allows practitioners to longitudinally monitor a range of kinetic and kinematic data related to neuromuscular function, and establish whether increases or decreases in vertical jump performance are observed after specific phases of training. GPS and HR technology is commonly used during field-based training to monitor external and internal training load (McGuigan, 2017), and a range of metrics such as peaks speed, distance covered, and HR responses to drills / training can be used to potentially infer changes in speed and CRF (Comfort et al., 2018), without the need to perform any strict isolated testing.
Equally, in resistance training, intensity and volume-load should be monitored every session. For example, if an athlete’s training load (i.e., working sets performed during training) for a back squat were 5 repetitions of 100kg at week 1, and by week 6 after a strength mesocycle they are now performing 5 repetitions at 120 kg, it is clearly evident that this athlete has become stronger. Consequently, a specific 1RM assessment is unnecessary to indicate an improvement in strength. Furthermore, bar velocity (known as velocity-based training) can also be examined during resistance training as a method to provide objective feedback and increase movement intentions and motivation during strength and power sessions (Weakley et al., 2021). Bar velocity is commonly monitored using a linear position transducer or an accelerometer attached to the bar. Again, if bar velocity is increasing during training with the same corresponding loads after a period of training (e.g. mean concentric velocity has increased from 0.5 to 0.6 m/s with the same load), this improvement is a positive adaptation which has been monitored during training without performing a structured testing session. The approach of “training is testing” and “invisible monitoring” of key metrics during training sessions using wearable or other forms of technology is increasing in accessibility and application in high-performance environments.
Testing order is largely influenced by the amount of rest required between tests, and the potential fatiguing effects of the prior testing on the following tests (Comfort et al., 2018). If testing over one-session, the following testing order is recommended (NSCA, 2016; McMahon et al., 2018):
- Baseline information
- Anthropometrics, body composition, flexibility, and range of motion (screening)
- Speed / power / COD speed
- Muscular endurance
- Aerobic / CRF fitness
However, when testing large squads, with limited equipment / time, it may be more appropriate to stagger testing times to avoid a large number of players unnecessarily waiting to be tested. Another approach is a “round robin” where athletes test at different stations concurrently and rotate between tests (i.e., some perform vertical jumps / strength, while some perform sprints) and then the whole group perform the aerobic / fitness test at the end (McMahon et al., 2018). Although this may alter the sequence of testing, logistically this will permit a larger group to be tested in a shorter time, and the sequencing should be consistent for longitudinal assessments.
Table 3 provides a list of tests which are generally used in strength and conditioning / high performance environments. Please note this is not an exhaustive list, and this a basic overview of some of the most commonly adopted tests used to assess athlete fitness characteristics.
- Testing standardisation is important to ensure collected data is valid and reliable.
- Changes in fitness can only be deemed “real” if it exceeds the measurement error associated with the test.
- While structured fitness is important, practitioners are encouraged to explore options of “invisible monitoring”.
Atkins, S. (2018). 1 Ethical and health and safety issues. Performance Assessment in Strength and Conditioning.
Atkinson, G., & Nevill, A.M. (1998). Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Medicine, 26(4); 217–238
Batterham, A.M., & George, K. (2003). Reliability in evidence-based clinical practice: A primer for allied health professionals. Physical Therapy in Sport, 4(3), 122–128.
Comfort, P., Jones, P. A., & Hornsby, W. G. (2018). Chapter 5. Structured testing vs. continual monitoring. Performance Assessment in Strength and Conditioning. Routledge.
Herrington, L. C., Munro, A. G., & Jones, P. A. (2018). Chapter 6. Assessment of factors associated with injury risk. Performance Assessment in Strength and Conditioning.
Hopkins, W.G. (2000). Measures of reliability in sports medicine and science. Sports Medicine, 30(1), 1–15
Laursen, P., & Buchheit, M. (2019). Science and application of high-intensity interval training. Human kinetics.
McGuigan, M. (2017). Monitoring training and performance in athletes. Human Kinetics.
McMahon, J. J., Jones, P. A., & Comfort, P. (2018). Chapter 4. Standardisation of testing. Performance Assessment in Strength and Conditioning. Routledge.
Mundy, P. M., & Clarke, N. D. (2018). Chapter 3. Reliability, validity and measurement error. In Performance assessment in strength and conditioning (pp. 23-33). Routledge.
Thomas, J., Nelson, J., & Silverman, S. (2005). Research methods in physical activity (5th edn.). Champaign, IL: Human Kinetics.
Van Winckel, J., McMillian, K., Meert, J.P., Berckmans, B. and Helsen, W. 2014. Fitness testing. In: J. Van Winckel, W. Helsen, K. McMillian, D. Tenney, J.P. Meert and P. Bradley, eds., Fitness in Soccer: The Science and Practical Application in Soccer . Manipal Technologies Ltd., India: Moveo Ergo Sum/Klein-Glemen, pp. 123–148.
Weakley, J., Mann, B., Banyard, H., McLaren, S., Scott, T., & Garcia-Ramos, A. (2021). Velocity-based training: From theory to application. Strength & Conditioning Journal, 43(2), 31-49.