Академический Документы
Профессиональный Документы
Культура Документы
net/publication/236107029
CITATIONS READS
3 12,995
5 authors, including:
30 PUBLICATIONS 19 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Héctor Cobos-Aguilar on 21 May 2014.
Abstract
Introduction: In order to evaluate complex skills as research, valid and reliable instruments must be
constructed. Purpose: To report the process of construction, validity and reliability of an instrument to
assess the critical reading of medical papers. Methodology: An electronic research of published
medical papers was performed. After several revisions six research designs were finally selected:
validation of instruments, surveys, cases and controls, diagnostic tests, clinical trials, and cohorts. An
abstract from each paper was made in Spanish emphasizing methodological issues (designs,
instruments used, sample size, blind evaluations, bias, the selection of statistical methods, and
others). From the abstract several items were constructed which were grouped in headings related to
the methodological issues and the indicators of critical reading used: Interpretation (hidden aspects),
judgments (evaluating the best methods), and proposals (suggesting better methods) were explored.
Each item had a true o false answer. An initial 157 item was constructed. Validation: Five experts were
invited to evaluate, in two independent rounds, if the instrument had theory, construction or content
validity. The experts suggested changes that were made in these two rounds. At the same time they
answered the final items as true or false with an agreement 4/5 or 5/5 to be acceptable. A pilot
application was made to students for further adjustments. The final version of the instrument had 108
items, 36 for each indicator and 18 for each design, the half of the correct answers were true and the
other half were false. Application: The instrument was resolved by three different experienced groups
in critical reading G1 (n: 7) professors of medical specialties G2 (n: 23) medical interns in an active
research course and G3 (n: 24) medical students without any research course. The answers were
determined as True, False or, Don’t Know. Each correct answer (True or False) added one point and
each incorrect answer subtracted one point. Questions answered with Do not know did not add or
subtract any points. This is how the final grade, which was expressed as group medians, was
determined. The grading was performed through an electronic system especially created to minimize
data capturing errors and it was carried out blind by staff unrelated to the research. The data were
analyzed with version 15 of the SPSS program. A random level was determined in the
groups.Statistics: The intra e inter agreement raters were obtained. 20-KR was used to calculate the
reliability. Kruskal-Wallis test were used for comparing the groups . Spearman test relating school
average and global medians were calculated. The random level was determined as well. Results. The
raters agreed in validity of instrument. The whole reliability was 0.75. The inter agreement value was
0.82 and the intra agreement was 0.80. The global median for G1 was 62, for G2: 28 and for G3:11.
The random level calculated was 17. All the results were significant for G1 <=0.01 for global, the three
indicators and the six designs. The level of answer random was 0% per G1, 13% for G2 and 83% for
G3. Comments: The validity and reliability of the instrument is good to evaluate the critical reading.
The contrast results among different experienced groups and the rand values sustained its reliability.
The instruments should be valid and reliable to assess this complex skill.
Keywords: Validity, reliability instrument, critical reading.
1 INTRODUCTION
It is currently pressing to use methods that will guarantee consistency in observations when searching
for variables by means of instruments - scales, questionnaires etc. – and that frequently depend on
the designer’s experience. Although usually referring to qualitative indicators [1], their construction
depends on reviewers mastering the elements to be evaluated in their professional practice. Due to
the need to elaborate observation or measurement instruments such as psychological or psychometric
005287
methodological aspects of the different research study designs presented in published research
papers [25].
In researching this methodological aptitude, every report is accompanied by a reference on the
construction of the instrument pretending to measure the development of critical reading among
students. Instruments are reported as validated and consistency reports are also adapted.
Nevertheless, they do not meticulously refer to the validation process and hence, do not contribute
previous experience to the construction process and little is known on the variability of the answers
provided by experts to the items being explored [26]. In other research areas with more empirical
viewpoints, this exercise is common [27].
This study is aimed at identifying inter-observer and intra-observer variability among experts in the
construction of instruments designed to measure critical reading of research papers.
1.1 Objectives
Describe the validation process and consistency of an instrument evaluating critical reading of
research papers. Determine intra and inter-observer agreement among experts designing an
instrument to evaluate critical reading development.
2 METHODS
2.1 Design
Validation and consistency of a measuring instrument.
2.2 Variables
Independent: experience in the validation of instruments evaluating critical reading of factual research
and critical reading of research. Dependent: intra and inter-observer consistency values. The following
hypotheses were proposed: a) inter and intra-observer response consensus among experts evaluating
critical reading of research is above 0.60. Independent variable: experience in the validation of
instruments evaluating critical reading of factual research and critical reading of research. Dependent
variable: results obtained by expert consensus.
2.3 Population
Five experts in critical reading of research and that collaborated in the instrument’s validation.
005288
study and its design, measuring instruments, the conditions of data application and collection and the
analysis of the information. Proposal: To think of components, elements or theoretical or
methodological procedures that would further strengthen and validate or emphasize novelty, relevance
or pertinence of the critiqued research paper. Based on these three indicators, several items exploring
the papers’ methodology were generated. Grammar rules were strictly followed in order to avoid
confusion when reading the summaries. Items were to be answered as “true” or “false” according to
the expert’s experience. They also had to evaluate the theoretical validity, construction and content of
each summary, assertion and corresponding items, as well as the overall instrument.
From the beginning, we decided to balance the entire instrument with: 108 items, 18 for each study
design and 36 for each indicator. The items considered “false” or “true” were also balanced. In order to
guarantee that number, the instrument initially included 157 items.
2.9 Consistency
The instrument’s internal consistency was determined with Kuder-Richardson’s formula [20]. The
results of three groups with different experience in critical reading of factual research were compared.
These groups were selected and conformed as follows: a) a natural group of medical students
beginning the fifth semester of their training and with no previous exposure to critical reading of
research papers (n=24), b) a group of sixth year students (social service) partially exposed to
educational strategies developing critical reading of research papers (n=23) and c) a group of full-time
professors with a medical specialty and with previous experience in critical reading (n=7).
005289
The instrument’s application was programmed independently for all three groups. Each of the
instrument’s items were answered as “true”, “false” or “I don’t know” and they were graded by adding
one point for each correct answer and subtracting one point for each incorrect answer. An “I don’t
know” answer neither subtracted nor added points. An individual extraneous to the study graded the
instrument after the first and second application as well as after application to the three groups; it was
subsequently submitted for analysis with a capture program especially designed to minimize errors.
The three groups were compared with medians and ranges of the interpretation, judgment and
proposal indicators as well as of the included study designs. Kruskal-Wallis’ test was used to compare
the three groups’ medians. The answers randomness factor in the three groups was determined to
classify the results into various categories (random, very low, low, intermediate, high, very high) [29].
3 RESULTS
Of the 157 items initially proposed, 68% agreement was obtained in 106 according to the answers
(4/5, 5/5). The 51 items that had not been agreed upon were sent for a second validation and
agreement was reached for 29 items. The instrument finally included the items that fulfilled the
previous criterion in both rounds and comprised 132 of the original total (84%). The rest of the items
were discarded.
Of the 132, another 24 were rejected. The elimination criterion was determined according to the
experts’ comments on the item’s ambiguity or lack of clarity or because the answers were deemed too
easy.
The final instrument included 108 balanced items, 18 of which focused on study design and 36 on
interpretation, judgment and proposals; half of the answers were “true” and half were ‘false”.
According to Kuder-Richardson’s formula, consistency was 0.75. Inter-evaluator agreement was 0.82
for the five questions in both rounds (0.38 for 5/5 and 0.44 for 4/5). Intra-observer agreement ranged
from 0.70 in evaluator five to 0.9 in evaluator three. The rest scored above 0.8.
Fig. 1 shows the results of the comparison between the three groups and the significant difference
obtained in the medians.
Table 1. Results of the three contrast groups in the three subcomponents and six study designs.
Overall Int Jud Pro CC Coh Dxt Surv RCT Instr
Grp. 108 18 18 18 18 18 18 18 18 18
1 (n=7) 62 16 24 25 10 10 10 12 12 12
2 (n=23) 28 8 14 10 6 5 7 4 2 7
3 (n=24) 11 2 5 7 3 0 2 0 1 4
0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
* Kruskal-Wallis. Mann-Witney U <0.05 favoring grp. 1 in all cases. NS χ 2 when comparing all three indicators.
Fig. 2 shows the asymmetry favoring group 1 when the three student groups’ results were translated
into levels.
005290
Fig. 2. Levels reached by students after excluding randomness factor.
Percentages are in parentheses.
Groups
Level Range 1 (n=7) 2(n=23) 3(n=24)
Very High >86
High 69-85 2 (0.30)
Intermediate 52-68 5 (0.70) 1 (0.04)
Low 35-51 8 (0.35)
Very Low 18-34 11 (0.48) 4 (0.17)
Random <17 3 (0.13) 20 (0.83)
4 COMMENTS
We believe that this study presents the key aspects of methodological rigor required for the
construction, validation and consistency determination of an instrument evaluating critical reading of
factual research papers.
The authors are professors actively involved in educational strategies condoning participation when
developing critical reading of research papers; they have been continuously involved in the elaboration
of these instruments for the past 15 years and they use new reading guidelines developed for each
course.
The first elaborated instruments evaluating the development of critical reading only included four study
designs and we now retrospectively believe that their content validity has increased since students
face research problems requiring different methodological design approaches.
This instrument is balanced for the indicators of the subcomponents of critical reading as well as in the
number of items for each summary. We added facets on the construction and validation of instruments
and surveys since we believe they are necessary to evaluate the students’ progress. “True” and “false”
answers were also balanced in order to prevent an answer bias. This could provide further reliability to
the obtained results.
The validation process conducted by the evaluators also contributed to the instrument’s theoretic,
construct and content validity. The first, supported by Viniegra’s contributions relate to the critical
analysis of experience and the development of complex skills, including critical reading particularly of
factual research papers. Construct and content validity were assured with Delphi’s technique as
required in these cases.
The results obtained in consistency permit a more rigorous evaluation of this complex skill.
One of the problems we faced was the selection of so-called experts that we have named evaluators.
There are no references relating to this point; in general, when publications refer to evaluators, they
only require that they have experience in the area of interest although this does not assure the
expert’s quality. From an educational viewpoint, the complex skill of critical experience is not obtained
through academic degrees but rather through the continuous, decanted, reflexive and progressively
challenging exercise of research critique, factual in this case, and closely related to docent activities
and the development of educational strategies leading to the students’ autonomous acquisition of
knowledge through the elaboration of reading guidelines that foster critical thinking.
The criteria for tutor selection in this article were not validated per se nor their consistency determined,
but the agreement results are significantly above the hypothesis’ expectations and confirm that they
are appropriate indicators.
Another indirect fact favoring evaluator selection was the high rate of agreement in the first and
second rounds, leading to an instrument with a 5/5 and 4/5 agreement ratio for all correct or incorrect
answers (100%).
However, intra-evaluator agreement –this article’s main objective- further confirms this point. We
believe that the fact that evaluators were unaware of the second evaluation of the correct and
incorrect answers as well as the time elapsed between evaluations, increased the study’s value. The
actual distance between evaluators may hinder collaboration in instrument evaluation; this can be
005291
improved using electronic media but evaluators should be blinded to the previously reported results
when intra-evaluator agreement is under scrutiny.
Agreement values exceeded those presented in the hypothesis (0.60) so selection criteria do confirm
individual experience (0.72 to 0.90) that we consider adequate. As mentioned in the introduction,
variability among expert evaluators is not frequently studied and we consider it to be indispensable,
particularly when the criteria defining an expert are not well established.
On the other hand, we also determined the kappa value with confidence intervals and they were not
congruent with the results obtained in our intra-observer agreements. This apparent contradiction has
a theoretical basis. The kappa statistic is not the most appropriate in this case because it
predominantly focuses on the prevalence of a research aspect unrelated to education and learning;
thus, we chose agreement as the measurement evaluating this point.
Table 1emphasizes this point, since the contrast between the three groups connote the students’ lack
of experience in this transcendental skill required for their future medical practice; it is most developed
in the group of professors.
The instrument also reveals the authors’ progress and suggests that the elaborated instruments will be
permanently complex.
Table 2 confirms the distance between the three groups when the randomness factor is excluded. The
results support the consistency values determined with Kuder-Richardson’s formula. The different
expression of results in three strata clearly relates to experience in research and supports our
validation results since the instrument does bring to light evidence reflecting the development of this
complex ability.
More precise studies investigating evaluator consistency will also increase their skills and foster the
availability of highly trained personnel that can play a role in these activities that are not limited to
evaluation instruments, but may also encompass research protocols in different institutional settings.
REFERENCES
[1] Pope C, Ziebland S, Mays N. Qualitative research in health care. Analyzing qualitative data.
BMJ. 2000;320:114-16.
[2] Bland JM, Altman DG. Validating scales and indexes. BMJ 2002;324:306-7.
[3] Prieto G, Muñiz J. Un modelo para evaluar la calidad de los test en España. Papeles del
psicólogo 2000;67:65-77.
[4] Boynton PM, Greenhalg T. Selecting, designing, and developing your questionnaire. BMJ
2004;328;1312-5.
[5] Merino SC, Lautenschlager GJ. Comparación Estadística de la Confiabilidad Alfa de Cronbach:
Aplicaciones en la Medición Educacional y Psicológica. Revista de Psicología de la Universidad
de Chile. 2003; XII (2):127-136.
[6] Briones G. Validez y confiabilidad des pruebas de medición. En: Métodos y técnicas de
investigación para las ciencias sociales. Ed. Trillás.México, D.F. 2003.
[7] Boynton PM, Greenhalgh T. Hands-on guide to questionnaire research. Selecting, designing,
and developing your questionnaire. BMJ 2004;328:1312–5
[8] Mays N, Pope C. Qualitative research in health care. Assessing quality in qualitative research.
BMJ. 2000; 320:50-2.
[9] Bisquerra R. Métodos de investigación educativa Guía Práctica. Ed. Ceac. Barcelona, España.
1989.
[10] Jones J, Hunter D. Consensus methods for medical and health services research. BMJ
1995;311:376-380.
005292
[11] Landeta, Jon. El método Delphi. Una Técnica de previsión para la incertidumbre. Ariel.
Barcelona y Godet, Michel. Manuel de Prospective Strategique. Dunod. Paris (1999).
[12] Ledesma R. Alpha CI: un programa de cálculo de intervalos de confianza para el coeficiente
alfa de Cronbach. Psico-USF, 2004;9(1):31-37.
[14] Van Rooyen S, Godlee F, Evans S, Black N, Smith R. Effect of open review on quality of
reviews and on reviewers’ recommendations: a randomized trial. BMJ 1999;318:23-27 .
[16] Viniegra VL. Educación y crítica. El proceso de elaboración de conocimiento. México: Paidós;
2002.
[18] Cobos AH, Insfrán SMD, Pérez CP, Elizaldi LNE, Hernández DE, Barrera MJ. Lectura crítica de
investigación en el internado de pregrado en hospitales generales. Rev Med IMSS 2005; 43 (2):
117-124.
[19] Soler HE, Sabido SC, Sainz VL, Mendoza SH, Gil AI, González SR.
Confiabilidad de un instrumento para evaluar la aptitud clínica en residentes de medicina
familiar. Arch Med Fam 2005; 7 (1): 14-17.
[20] Pantoja PM, Barrera M J, Insfrán SMD. Instrumento para evaluar aptitud clínica en
anestesiología. Rev Med IMSS 2003; 41 (1): 15-22.
[21] Rivera CJM, Leyva GFA, Leyva SCA. Desarrollo de la aptitud clínica de médicos internos de
pregrado en anemias carenciales mediante una estrategia educativa promotora de la
participación. Rev Invest Clin 2005; 57 (6): 784-793.
[22] Soler HE, Sabido SC, Sainz VL, Mendoza SH, Gil AI, González SR. Confiabilidad de un
instrumento para evaluar la aptitud clínica en residentes de medicina familiar. Arch Med Fam
2005; 7 (1): 14-17.
[23] Pantoja PM, Barrera M J, Insfrán SMD. Instrumento para evaluar aptitud clínica en
anestesiología. Rev Med IMSS 2003; 41 (1): 15-22.
[24] Rivera CJM, Leyva GFA, Leyva SCA. Desarrollo de la aptitud clínica de médicos internos de
pregrado en anemias carenciales mediante una estrategia educativa promotora de la
participación. Rev Invest Clin 2005; 57 (6): 784-793.
[25] Matus MR, Leyva GF, Viniegra VL, Lectura crítica en estudiantes de enfermería: efectos de una
estrategia educativa. Rev. Enferm. IMSS. 2002; 10 (2): 67- 72.
[27] López CJM, Araiza AR, Rodríguez MR, Murguía MC. Construcción y validación inicial de un
instrumento para medir el estilo de vida en pacientes con diabetes mellitus tipo 2.. SPM
2003;45(4)259-268.
[28] Siegel S, Castellan JS. Capítulo 8. Medidas de asociación y sus pruebas de significación. En:
Estadística no paramétrica aplicada a las ciencias de la conducta. 4ª. Ed. Ed. Trillás. México,
D.F. 2005. p: 325-333.
[29] Pérez-Padilla J.R., Viniegra V.L. Método para calcular la distribución de las calificaciones
esperadas por azar en un examen del tipo falso, verdadero, no sé. Rev.Inv.Clin. 1989; 41:375-9
005293
ADDENDUM 1 (FRAGMENT)
Role of the active and passive smoker in lung cancer
Lung cancer may have multiple etiologies. The aim of the study is to identify its relation with passive
and active smoking among other risk factors. 385 patients (cases) with a pathology-confirmed
diagnosis of lung cancer were studied in tertiary care hospitals. 898 healthy controls recruited in the
community were also studied with multi-staged surveys, in a randomized manner and stratified by age.
Cases and controls (1:2 ratio) completed a questionnaire applied by two trained interviewers that were
unaware of the study’s purpose. It included questions on active or passive tobacco exposure, number
of smoked cigarettes per day and age at which smoking began (if he smoked). Data were captured
twice and analyzed with SPSS. Bivariate odds ratios (OR) and 95% confidence intervals (CI) were
used for statistical analysis.
005294