Saudi Journal for Health Sciences

: 2020  |  Volume : 9  |  Issue : 2  |  Page : 84--87

Relationship of text length of multiple-choice questions on item psychometric properties – A retrospective study

Dareen Khalid Aljehani1, Fawaz Pullishery2, Omer Abdelgadir Elfaki Osman3, Basem Mohamed Abuzenada4,  
1 Department of Orthodontics, Batterjee Medical College, Jeddah, Saudi Arabia
2 Department of Community Dentistry and Research, Batterjee Medical College, Jeddah, Saudi Arabia
3 Department of Medical Education, Batterjee Medical College, Jeddah, Saudi Arabia
4 Operative Dentistry and Head of the Program, Dentistry Program, Batterjee Medical College, Jeddah, Saudi Arabia

Correspondence Address:
Fawaz Pullishery
Department of Community Dentistry and Research, Batterjee Medical College, P. O. Box 6231, Jeddah 21442
Saudi Arabia


Background: Item writing flaws while constructing multiple-choice questions (MCQs) have serious impact on different psychometric properties of questionnaire. The study aimed to evaluate the relationship of length of questions of (MCQs) items on difficulty factor (DF), discrimination index (DI), and Point Bi-serial (rBP) of a dental program assessment. Materials and Methods: The cross-sectional study included 627 MCQs. The data were analyzed from the report achieved through ExamSoft software. The questions were divided into long (words >100); medium (words of 70–100); and short (words <70). We divided the DF into hard (DF <0.3), average (DF = 0.3–0.8) easy (DF >0.8); DI into negative (DI <0), DI = 0–0.2 and DI >0.2; Point Bi-serial into “Negative” (rBP <0), rBP = 0–0.2 and rBP >0.2; Pearson's Chi-square test was used to find a relationship between length of question with other variables. Results: Thirty-one long MCQs, 56 medium, and 540 short MCQs were achieved based on the analysis. There was a statistically significant association found between DF and length of the questions (P < 0.05). No significant relationship between the length of the questions with DI and Point Bi-Serial factors. The median of DF was 0.6300 (interquartile range [IQR] 0.41). The median length of the MCQs was found to be 35.0 (IQR 25.0). Conclusion: The study proved that the length of the question has an impact on the DF but not always with the DI or Point Bi-serial.

How to cite this article:
Aljehani DK, Pullishery F, Osman OA, Abuzenada BM. Relationship of text length of multiple-choice questions on item psychometric properties – A retrospective study.Saudi J Health Sci 2020;9:84-87

How to cite this URL:
Aljehani DK, Pullishery F, Osman OA, Abuzenada BM. Relationship of text length of multiple-choice questions on item psychometric properties – A retrospective study. Saudi J Health Sci [serial online] 2020 [cited 2020 Sep 27 ];9:84-87
Available from:

Full Text


In the field health profession education, competency assessment has become a cornerstone in evaluating clinical abilities and overall skills of the students. A good assessment method plays a vital role and offers an insight into students approach to learning and performances. Throughout the world, multiple-choice questions (MCQs) are the common type of format that is used to assess students in dental and other health allied science disciplines. This format allows the faculty to efficiently evaluate a large number of candidates and also helps to test a wide range of topics.[1],[2]

MCQs, when constructed properly, are one of the best tools to assess cognitive skills and could be efficiently used to discriminate high and low achievers. A very good MCQ should have very less or minimal items writing flaws (technical error), if present will affect the Student's performance thereby reducing the validity and reliability of the assessment process.[3]

Item analysis report (IAR) of an assessment method is an important and easy method to yield information regarding the reliability and validity of a test item. In item analysis, the commonly measured properties of an MCQ are difficulty index (facility value), discrimination index (DI), or Point Bi-serial (rPB). Difficulty index sometimes denoted as difficulty factor (DF) or P value tells us about the percentage of performers who correctly answered an item and it ranges from 0% to 100%. The optimal range of difficulty is from 30% to 80% (0.30–0.80). Items having difficulty index below 30% are considered difficult and those above 80% (>0.80) are easy items.[4],[5]

DI can be defined as the property of item to discriminate between students who top in scoring with low scorers. In short, it is the measure of how good performers are answering a particular item when compared to poor or low performers and it ranges from −1.00 to +1.00. The Point Biserial (rPB) also used as a measure of item discrimination and the only difference between DI and rBP is that DI compares the proportion of correct responses for an item between the high and low performer on the test as a whole whereas rPB is the correlation between the students overall examination scores and an individual question score.[6],[7] Items with a DI or rBP of 0.40 or more are considered as a “very good,” 0.30–0.39 as “reasonably good,” 0.20–0.29 as “marginal,” and <0.20 as poor.[6],[8],[9] There are many factors that have an effect on the DI and DF. Some of them include language, grammar, areas of controversies, types and number of distracters, and unfocused questions.[3],[10]

Even though Arabic is the first language used in the Kingdom of Saudi Arabia, most of the health professional courses are taught and evaluated in the English language. English is considered as a foreign language and students start to learn this language in the 4th year at the primarily level, which makes a total of 4 weekly session. Experts have the opinion that this duration is not sufficient for the students to acquire enough proficiency in the language for higher studies.[11] Thus, students may perceive some difficulties when attending health professional courses, as most of the courses are taught and evaluated in English. There are no known studies done to see the relationship of text length of MCQs on item psychometric properties. In countries where English is the first language students may not face this difficulty as they have excellent proficiency in their language skills. However, in Arabic speaking countries, students may face challenges as English is usually taught in the first or preparatory year of health sciences courses after which they are thrown into a sea of medical terms and texts.[12],[13],[14] Hence, this study was aimed to see the impact of text length of MCQs on DF, DI and Point Biserial in a final assessment for undergraduate dentistry program students conducted in one of the dental colleges in the Kingdom of Saudi Arabia.

 Materials and Methods

This study was carried out in the dentistry program of private dental school as a part of final assessment of the 2018–2019 academic year. Ethical approval and permission was obtained from the Institutional Research and Ethics Committee (BMC-Res-2018-0026) and informed written consent was taken from the Medical Education department to use the analysis report of the assessment. The assessment included a total of 627 MCQs that were taken from final assessment of eleven dental courses and all were of four-option type. The current dental program is a 7-year program, which included one preparatory year, 3 years of preclinical courses, 2 years of in-depth clinical courses, and a 1 year internship. The MCQs were randomly chosen out of 952 and were classified based on the text length of questions into long (words >100); medium (words of 70–100); and short (words <70). We analyzed the relationship of the length of MCQs with the level of difficulty (DF), the power of discrimination measured by DI and Point Biserial (rPB).

The IAR was calculated and supplied in reports achieved by ExamSoft software (ExamSoft Worldwide, Inc. USA). The IAR included three properties of such as DF, DI, and RPB and text length of each MCQ was calculated using Microsoft Excel worksheet. MCQs were categorized into easy, moderate, and hard questions based on difficulty index. The DI and Point Biserial were categorized into very good, reasonably good, marginal, poor and negative. Data were analyzed using SPSS ver 23 (IBM SPSS Statistics for Windows, Version 23.0. Armonk, NY: IBM Corp.) and Pearson's Chi-square test was used to find a relationship of text length with DF, DI, and RPB. A significance level, P < 0.05 is considered to be statistically significant.


The median for Difficulty Index was found to be 0.630 (interquartile range [IQR] 0.41), for DI it was 0.2500 (IQR 0.27), and for Point Biserial, it was 0.2700 (IQR 0.23). The median for length of words was found to be 35 (IQR 25). The maximum number of word for an MCQ was reported to be 165 and minimum was 12 [Table 1].{Table 1}

There were a total of 31 (4.9%) long, 56 (8.9%) medium, and 540 (86.1%) short out of 627 MCQS. When MCQs where assessed for their difficulty index, we found out that out of 627 MCQs, 167 (26.6%) were easy, 348 (55.5%) were moderate, and 112 (17.8%) were hard questions, respectively. In our analysis, we found out that out of 540 short questions only 100 (18.5%) MCQs were hard, 299 (55.4%) remaining were moderate and 141 (26.1%) were easy. Among the 56 medium lengths MCQs, 10 (17.8%) were hard, 21 (37.5%) were easy, and 25 (44.6%) were moderate questions. When the long MCQs were assessed for the DF we noticed that only 2 (6.4%) were hard questions and remaining 5 (16.1%) were easy. 77.4% (24 out of 31 were moderate questions). When the relationship of text length and difficulty index was analyzed the results were statistically significant [P = 0.039; [Table 2].{Table 2}

When the DI was classified based on the text length it was found that 142 (22.6%) were very good, 101 (16.1%) were reasonably good, 175 (27.9%) were poor, and 120 (19.1%) were marginal MCQs. We also found that 89 (14.2%) MCQs were negatively discriminated, which showed no statistical significant association [P = 0.216; [Table 3]. The assessment of the relationship of text length with Point Biserial also did not show a significant association [P = 0.739; [Table 3].{Table 3}


This study was a pilot project that was done on MCQs used in summative examinations of eleven clinical dental subjects. To the best of our knowledge, no published research work was done on the same topic in any of the local or regional similar institutes. We identified an obvious scarcity in the literature of data related to the same topic. A total of 627 MCQs items were classified into three categories based on item length and three levels of item difficulty. When the relationship of text length and difficulty index was analyzed, the results were statistically significant. This showed that long questions are associated with increasing item difficulty compared to questions with short and moderate length. This finding could be explained on the basis that students may require more reading comprehension for items with increased text and may deviate from the actual aim of the assessment to more of a testing of the language skills. This sort of added difficulty should be considered as a type of construct irrelevant variance (CIV).[15] Evidences show that poorly constructed low-quality questions can cause construct-irrelevant variance.[16] Tests developed locally by teachers, which is the case in this study, are considered more vulnerable to CIV.[17] This is in addition to the fact that our students are nonnative English speakers. Studies done on students with limited English proficiency proved that their tests results were profoundly affected by their limited vocabulary.[18],[19] However, this effect of reading comprehension is not limited to students with limited English proficiency.[17] The ultimate effect of such increase in item difficulty due to long text is the risk of threatening the validity of the scores and decisions made on student's mastery based on it. This is because the student's ability to correctly answer a question was not limited to their level of learning.[20] Thus, the test measured an additional ability, which was not purposed to measure. To reduce the risk on validity, elimination or control of CIV is very essential. This could be achieved by faculty training by considering this particular factor.[17]

The analysis of the effect of item text length with both DI and rPB revealed no significant relationship, respectively. It is quite relevant to mention here that the length of the question and the reading comprehension might not be the only source of CIV in this study. Other sources of CIV including anxiety and test administration conditions could have affected the long questions and contributed to the added difficulty.[15] It should also be acclaimed that we most of the students were native speakers of the examination language, but the proficiency in the language was not matched between the examination takers which may pose some confounding bias. Furthermore, an important limitation of the current study is that the sources of the questions are clinical dental courses and therefore the results may not be generalized to other basic science courses, other medical disciplines or other institutes. However, the study confirmed what had been found in similar previous study.[21] More studies are recommended on more health sciences programs involving a large number of samples preferably at a multi-center level.


Item writing flaws (IWFs) in test items have already proven to have a potential impact on the psychometric properties. In our study, the effect of text length did not have a significant impact on the Difficulty Index, DI and Point Biserial. Whether the text length should be considered as an IWF in constructing test item (especially in MCQs) for non-English speaking exam takers when attending an exam, which is constructed in English, is still not yet clear. There is a need for a wider study in this area and is also a topic of discussion by the experts in dental education.


All the authors would like to express their gratitude to Dr. Osama Kensara, the dean of Batterjee Medical College and the Medical Education department for extending their support for this research.

Financial support and sponsorship


Conflicts of interest

There are no conflicts of interest.


1Downing SM. Assessment of knowledge with written test forms. In: Norman GR, van der Vleuten C, Newble DI, editors. International Handbook of Research in Medical Education. Dordrecht: Kluwer Academic Publishers; 2002. p. 647-72.
2McCoubrie P. Improving the fairness of multiple-choice questions: A literature review. Med Teach 2004;26:709-12.
3Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ 2008;42:198-206.
4Miller MD, Linn RL, Gronlund NE, editors. Measurement and Assessment in Teaching. 10th ed.. Upper Saddle River, NJ: Prentice Hall; 2009.
5Ebel RL, Frisbie DA. Essentials of Educational Measurement. 5th ed. Englewood Cliffs, New Jersey: Prentice-Hall Inc.; 1991.
6Engelhardt PV. An introduction to classical test theory as applied to conceptual multiple-choice tests. In: Henderson C, Harper KA, editors. Getting Started in PER. Vol. 2. College Park: American Association of Physics Teachers. Reviews in PER; 2009.
7De Champlain AF. A primer on classical test theory and item response theory for assessments in medical education. Med Educ 2010;44:109-17.
8Rahim AF. What those Number Mean? 1st ed.. Kubang Kerian: KKMED; 2010. Available from: [Last accessed on 2019 May 25].
9Odukoya JA, Adekeye O, Igbinoba AO, Afolabi A. Item analysis of university-wide multiple choice objective examinations: The experience of a Nigerian private university. Qual Quant 2018;52:983-97.
10Dufresne RJ, Leonard WJ, Gerace WJ. Making sense of student's answers to multiple-choice questions. Phys Teach 2002;40:174-80.
11Alhmadi NS. English speaking learning barriers in Saudi Arabia: A case study of Tibah University. Arab World Engl J 2014;5:38-53.
12Rass RA. Challenges face Arab students in writing well-developed paragraphs in English. Engl Lang Teach 2015;8:49.
13Malcolm D. Reading strategy awareness of Arabic-speaking medical students studying in English. System 2009;37:640-51.
14Mourtaga KR. Some reading problems of Arab EFL students. J Al-Aqsa Univer 2006;10:75-91.
15American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 1999.
16Ware J, Kattan TE, Siddiqui I, Mohammed AM. The perfect MCQ exam. J Health Spec 2014;2:94-9.
17Downing SM. Threats to the validity of locally developed multiple-choice tests in medical education: Construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ Theory Pract 2002;7:235-41.
18Abedi J, Lord C, Hofstetter C, Baker E. Impact of accommodation strategies on English language learners' test performance. Educ Meas 2000;19:16-26.
19Fitzgerald J. English-as-a-second-language learners' cognitive reading processes: A review of research in the United States. Rev Educ Res 1995;65:145-90.
20Premadasa IG. A reappraisal of the use of multiple choice questions. Med Teach 1993;15:237-42.
21Loudon C, Macias-Muñoz A. Item statistics derived from three-option versions of multiple-choice questions are usually as robust as four- or five-option versions: Implications for exam design. Adv Physiol Educ 2018;42:565-75.