Last updated: September 28, 2024
I. Background
The availability of objective and standardized metrics used to differentiate applicants has been a long-established challenge for programs in graduate medical education. This includes the United States Medical Licensing Exam (USMLE) Step 1 and Comprehensive Osteopathic Medical Licensing Exam (COMLEX) Level 1 transitioning to pass/fail scoring systems in 2022.[1],[2] This change, in particular, added increasing difficulty for program directors to assess and compare applicants’ academic preparation on a uniform scale, leaving them with limited options, such as USMLE Step 2 CK scores, COMLEX Level 2-CE scores, and/or the perceived prestige or ranking of the medical school as potential alternatives, This has led to concerns about fairness and consistency in the residency selection process.[3]
Standardized letters of evaluation (SLOEs) have gained favor in various specialties. SLOEs function as additional tools to assess academic and clinical performance in a more uniform manner. However, the information and standards provided in SLOEs are variable and so objective metrics remain scarce.
Historically, the Medical Student Performance Evaluation (MSPE) or “Dean’s Letter” has functioned as a “Summary letter of evaluation intended to provide residency program directors an honest and objective summary of a student’s salient experiences, attributes, and academic performance.”[4] The Association of American Medical Colleges (AAMC) Group on Student Affairs (GSA) Committee on Student Affairs (COSA) MSPE Effective Practices Working Group last released major recommendations in 2016, and then provided updated guidance in May 2022 following disruptions in medical education due to COVID-19.
While the MSPE has added some standardization of student academic performance, there is inconsistency in format and content by medical schools. What is defined as a required “core clerkship,” one that provides medical students a supervised training period with hands-on experience in patient care, also differs by medical school and specialty discipline.[5] Additional variability includes non-standard grading scales (categorical, letter, numerical, etc.) and grading distributions within and across medical schools. This makes understanding and comparison of grades difficult, requiring manual intervention that increases program director and other evaluator review time. This may also lead to an overreliance on letters of recommendation or other information provided by medical schools (or letter writers) that are well-known to the reviewers, enhancing the potential for bias and disadvantaging applicants from lesser known or newer medical schools.[6]
To address these concerns and to provide more objective comparison, Thalamus, in collaboration with AAMC, developed a novel transcript normalization tool. This feature utilizes a large language model (LLM) to provide a standardized method for evaluating medical school grades, and thus aids residency programs in making more objective, equitable and data-driven decisions to promote further equity in the transition to residency.
Key features of this novel transcript normalization tool include:
- Analysis of grades without manual review of a transcript and/or MSPE.
- Standardization of course naming and distributions across medical schools.
- Percentile rank of applicant core clerkship grades for comparison within/across schools.
II. Introduction
This document outlines the methodology behind the development of a novel transcript normalization tool within Thalamus’s Cortex application screening and review platform. This new feature was developed as part of an innovation grant made possible through the strategic collaboration between Thalamus and the AAMC. The tool aims to standardize and normalize medical school transcripts and core clerkship grades to facilitate more efficient and equitable application screening and selection for residency programs. The feature was created through a year-long grassroots collective involving data scientists, AI/ML experts, medical education researchers and technologists. A proof of concept was built and presented in November 2023 at the AAMC Learn Serve Lead (LSL) Conference in Seattle, WA. The tool has been further iterated on through discussions with and insight provided by specialty society and residency program leadership at various academic conferences, forums and webinars.
III. Objectives
The primary objective of the transcript normalization tool is to standardize academic information across medical schools by converting unstructured medical school transcripts into a structured format, thereby increasing consistency in evaluating applicants from diverse educational backgrounds, medical school types, geographies, and practice environments. By doing so, the tool aims to streamline the residency application process, reduce bias, and enhance the fairness of application evaluation.
This document was created to provide programs and applicants with a transparent view into how the feature works, consistent with Thalamus’s commitment to product inclusivity, accessibility and fairness. As a mission-driven Public Benefit Corporation (PBC), Thalamus continues to commit to efforts like these to promote its public benefit purpose, “Of ensuring greater access to affordable, high-quality medical education and training, addressing systemic inequities in the physician workforce, and delivering better healthcare outcomes for patients and society.”[7]
IV. Methods
1. Data Collection and Analysis
a. Transcript Selection
To build a comprehensive and inclusive normalization tool, Thalamus identified the entire population of medical school transcripts and grades for residency applicants to the Electronic Residency Application Service (ERAS) 2024 (September 2023 – March 2024) as part of its research and data collaboration with AAMC. This population was further narrowed to include a subset limited to all US MD (allopathic) and DO (osteopathic) granting medical schools, as well the international medical schools with the largest number of applicants that participated in the US residency match process. 10 applications were analyzed from each medical school for the initial data set. This list of institutions was selected for the prototype as they historically represent the highest frequency medical schools in ERAS, accounting for the majority of applicants, whereby adding the next incremental medical school only results in an additional marginal percentage increase in applicants being considered.
b. Manual Review and Mapping of Core Clerkships
i. Determination of Core Clerkships
A team of Thalamus subject matter experts (SMEs) manually reviewed a wide array of transcripts to identify and map the most common core clerkships across medical schools. This manual review was essential to understanding the variations in transcript formatting and nomenclature across each institution.
A “core clerkship” represents a standardized set of clinical experiences that are central to medical education and residency training readiness. Thalamus SMEs determined the following list to be the most inclusive and equitable given frequency and known importance in medical education. The percentage of medical schools in the sample which include that core clerkship are presented in parentheses.
Cortex Defined Core Clerkships:
- Internal Medicine (99.5%)
- Surgery (99.5%)
- OBGYN (99.5%)
- Pediatrics (99.5%)
- Neurology (56.7%)
- Psychiatry (99.1%)
- Family Medicine (86.0%)
Other core clerkships were observed in the analysis but were excluded due to low frequency (i.e. <20%) across the medical schools in the sample. These included Ambulatory and/or Primary Care, Longitudinal Primary Care, Osteopathic Manipulative Medicine (OMM), Anesthesiology, Emergency Medicine, Rural Medicine, Cardiology, Geriatrics, Radiology, Addiction Medicine, Hospice and Palliative Care, Ophthalmology, Clinical Bioethics, Population Health, Critical Care, Orthopaedic Surgery, Pathology, Otolaryngology, Medical Humanities, and Dermatology.
ii. Mapping of Core Clerkships
The analysis of core clerkships resulted in 1:1 mapping relationship between transcript grade, course name and course code, except for the following conditions:
- Some osteopathic and international medical schools have multiple core clerkships across each specialty/discipline (e.g. Internal Medicine I, Internal Medicine II, Surgery I, Surgery II, etc.). In this case, each clerkship grade is shown consecutively, concatenated with a comma (e.g. “Internal Medicine H,HP” etc.).
- Some allopathic, osteopathic and international medical schools “cluster” multiple core clerkship, into a singled “clustered clerkship” (e.g. Neurology, Psychiatry and Family Medicine are presented as a cluster with a single grade). In this case, this single defined grade was mapped to each corresponding Cortex defined clerkship.
iii. Grading scale determination
Cortex supports the most common grading scales used by medical schools. These included several permutations of categorical (e.g H/HP/P/F, H/P/F, H/NH/P/F, etc.), letter (e.g. A/B/C/D/F, A/B/C/F, etc.) and numerical grades (100/99/98/97…, etc.). As some schools use numerical grades that are then calculated into a categorical or letter grades, the final grading scale was confirmed by either the official grading information listed in the medical school transcript or MSPE, and/or following broader manual transcript review.
2. Optical Character Recognition (OCR) and Initial Data Processing
a. OCR Implementation
All medical school transcript pdfs were run through Optical Character Recognition (OCR) technology to convert unstructured transcript data into text representation format. The OCR process was carefully configured and refined to ensure maximum accuracy in recognizing text from a diverse array of transcript formats, fonts, and layouts.
- OCR Model: The OCR technology was iteratively refined to further reduce variability and error, enhancing the overall capability to accurately interpret the nuances in transcript data across different medical schools. Text was collected from all text-based transcripts. For image-based transcripts, Tesseract was used to extract the image as text.
- Quality Control: Each iteration of OCR output was subjected to a rigorous quality control process, identifying and correcting misreads, and refining the model’s accuracy through manual human intervention and review.
b. Text Representation Output
The OCR process resulted in a machine-readable text representation for each transcript. This structured text served as the foundation for further data processing, analysis and normalization.
3. Machine Learning and Large Language Model (LLM) Processing
a. LLM Utilization
The machine-readable text representations were processed using an LLM, specifically the GPT-4o-mini model hosted on Microsoft Azure.[8] This solution was selected given Thalamus utilizes Microsoft Azure for cloud hosting, which ensures overall data and model security. Through Thalamus’s contractual relationship with Microsoft, neither the data input, nor the trained model is publicly utilized or used to train any other GPT solution outside of Thalamus. This solution was fully vetted by Thalamus’s data security and compliance teams. The LLM was trained to recognize and categorize key components of medical transcripts, such as course names, clerkship titles, grades, and other pertinent information.
- Model Training and Iterative Refinement.
- Model Training: The model was initially trained on a dataset derived from the manual review and mapping process, ensuring it was grounded in a realistic understanding of the data.
- Iterative Refinement: The model underwent multiple iterations, each time refining its accuracy and reliability based on feedback from human reviewers.
4. Human Review and Model Refinement
a. Manual Review of Model Outputs
Following the LLM processing, “candidate-structured transcripts” were generated and then meticulously reviewed by multiple human reviewers. This step was crucial for:
- Identifying Inaccuracies: Reviewers corrected any inaccuracies or inconsistencies in the LLM’s outputs, ensuring the highest level of data fidelity.
- Providing Feedback for Retraining: Corrections and observations from the human review process were used to further train and refine the model, enhancing its accuracy and adaptability to various transcript formats.
5. Quality Assurance and Model Validation
After the manual review and retraining phases, additional quality assurance measures were implemented to ensure the final structured transcripts were accurate and reliable, including:
- Cross-Verification: Structured transcripts were cross-verified against the original unstructured versions to ensure all pertinent information was captured correctly.
- Random Sampling and Validation: Randomly selected transcripts were reviewed in detail to ensure consistency/accuracy, and filtered through a standardized rule set. This maximized the medical school transcript coverage whereby the reported grade presented to program users in Cortex would be 100% using this standardized rule set. A stratified, random sample of medical schools was used to ensure that each medical school was as equally as possible represented in the dataset.
6. Model Projection onto Future Transcripts for ERAS 2025 (September 2024)
To ensure the robustness and reliability of the transcript normalization tool across academic years and to continuously validate performance across new datasets, the model was used to analyze all residency applicant transcripts received (to-date) for ERAS 2025. This was accomplished through a daily iterative process, up to and including all transcripts received by season opening at 9am ET on 9/25/2024. Transcripts received after that date/time were included on a continual basis every hour, following receipt of the transcript from ERAS.
This projection served three main purposes:
- Accuracy Determination: Given this transcript normalization tool is first being used for the ERAS 2025 residency application cycle, by applying the model to the latest version of medical school transcripts, we were able to assess performance and accuracy in real-time, with newly formatted or updated transcripts. This process involved additional cross-verification, random sampling and validation.
- Continuous Improvement: The insights gained from this validation step were used to further refine and retrain the model, while addressing any new discrepancies or errors and incorporating updated transcript formats or grading standards that may have emerged. Program users may also submit error corrections identified while using this tool through the process discussed at the end of this document.
- Create Grading Profiles for Each Applicant: For every applicant for which a transcript was analyzed for ERAS 2025, a grading profile was created to include grades for each of the 7 Cortex Defined Core Clerkships (where applicable/available). Once determined, the grading profile for each applicant was final for the season. Each known and verified applicant grade for each clerkship, is presented in the summary view for each residency program utilizing Cortex to perform an application screen and review of that applicant. These grades were also used and included in the grading distribution calculations described in the next section.
7. Analysis and Determination of Grade Distributions
Understanding and normalizing the grade distributions for each core clerkship is essential for providing residency programs with a comprehensive and comparable view of applicant performance within and across medical schools. This analysis was conducted across several dimensions, based on the core clerkship grade profiles by applicant (by medical school) described above.
Each applicant’s core clerkship grade was included to determine the grade distribution for that core clerkship at their medical school (of graduation) or affiliated training site. For applicants that attended more than one medical school, only the transcript from the medical school they graduated from was included in the analysis.
As the tool is further developed, we will also analyze grading trends over multiple academic years to detect any shifts or changes in grading practices. This longitudinal analysis will assist with further model refinement and also assist residency (and fellowship) programs in understanding whether changes in student performance are due to actual differences in competencies or evolving grading standards.
8. Grade Displays: Parameters, Interpretation and Minimum Values
For each core clerkship a student’s grade will display in Cortex as long as 1) the student’s medical school has a defined and mapped Cortex Defined Core Clerkship, 2) Cortex was able to read and determine the grade. For clarity, the frequency of parsed core clerkships in the test sample is assumed to be random and representative of the broader population, therefore the accuracy of both the test and population will also be similar.
In the event that these conditions are not met, and Cortex is unsure of or unable to assess the grade with the minimal defined level of accuracy, a link showing “View Transcript” will be displayed in place of the grade. Selecting this link will navigate the reviewer to the applicant’s raw transcript PDF for further review.
For each of these analyses, a minimum threshold of applicant grades per medical school per core clerkship per academic year was determined to be required to display an applicant’s grade percentile and grade distribution graph in Cortex. This was determined by Thalamus’s research and development team as the minimum threshold to ensure statistical significance and meaningful insights based on the grading distribution.
Additionally, as long as the minimum threshold (above) is reached, a 1) grade percentile and 2) grade distribution graph will also display as follows:
- Interpreting the grade percentile ranks: For each core clerkship grade, Cortex will display a percentile rank next to each student’s grade, denoting the percent of students who received the same grade or lower at that medical school for that academic year.
- This percentile is based on the number of transcripts of students applying to and received by ERAS. The percentile is determined using the mathematical midpoint of each binned category. For example, if 80% of students passed a course, and 20% of students failed a course, the percentile rank of passed and failed scores would be 60 and 10 respectively. This may or may not vary from the graphs presented in the corresponding medical school’s MSPE for that academic year, as schools have different policies on the data that is presented, and may have varying numbers of applicants applying through other application services and/or forgo the match entirely.
- For medical schools that are entirely pass/fail, a grade of pass would correspond to the 50th percentile (assuming 100% of students received a grade of pass). Similar other statistical variations/deviations for particular medical schools will be displayed in corresponding tool tips within the Cortex product directly in a future update.
- Interpreting the Graphs: Each graph will display the core clerkship name and medical school name in the title. The x-axis will include the grading scale for that particular medical school (for that academic year) and the y-axis denotes the number of students in that academic year at that medical school receiving that grade in the specified core clerkship (and that applied via ERAS for that application cycle). The grade that the corresponding applicant received will be highlighted in a teal box on the graph that is labeled “applicant.”
Program directors and other evaluators can use these visualizations to compare applicant’s performances relative to their peers within the same school and against students from other institutions.
Note: If the minimum threshold is not met, no percentile or grade distributions will display.
By implementing these analyses and continuously validating the tool’s performance, Thalamus Cortex aims to provide a reliable and objective framework for residency selection, helping program directors and evaluators make informed decisions based on a comprehensive understanding of applicant performance.
V. Limitations
While the transcript normalization tool offers a novel, innovative and more objective approach to evaluating residency applicants, there are inherent limitations in these methods predominantly due to the vast variability in medical school grading practices. Users of this feature are advised to use this tool as a guide for comparison, but still consider the entirety of every application to facilitate holistic review. This also acts as a countermeasure to automation bias, and should also include grit and positive growth signals (and not just potential application “red flags” or “blemishes”). The AAMC has several helpful resources for residency programs to holistically consider the “whole” applicant.[9],[10]
As you make use of this tool, please use caution when comparing grades for applicants across years or schools. Remember that even with perfect grade normalization, applicant performance would still not be fully comparable due to several of the following reasons including:
- Medical school curricula and overall educational philosophies are non-standardized, including within and across allopathic, osteopathic and international medical schools.
- Some medical schools tend to award honors more liberally, while others use a simple pass/fail system, making it difficult to create a uniform standard for comparison.
- Some medical schools have gone to clustered grading (i.e. combining multiple core clerkships into a single grade).
- Grades are determined at medical schools using variable criteria, as well as a diversity of evaluator types, roles, quantity, and frequency. Shelf and other exam weighting, and other elements of final grades vary throughout.
- The overlap (or lack thereof) of information that is displayed on a medical school transcript vs. medical school performance evaluation at any institution varies.
- Medical students are usually randomly assigned the sub-cohort as to when they complete any core clerkship (from the first to last clinical rotation completed), and therefore are usually evaluated against a sub-cohort of their peers vs. their entirety of their class.
- Additional variability may be inserted for students pursuing combined/multiple degrees or other breaks in education.
- Core clerkships are offered over a variety of practice environments and geographies, even at the same medical school.
International medical graduates (IMGs) face additional methodological limitations. While Thalamus builds with a focus of and commitment to product inclusivity, it was not possible to accurately provide normalized transcripts for many international schools for the 2025 cycle due to the limitations listed below. The initial population of this study included international medical schools with applicants that met the frequency threshold described in the methods section. This prototype emphasized accuracy and thus the focus was on schools where accurate results could be provided with high confidence. As there are thousands of medical schools globally, additional work is being done to incorporate more schools as the project expands. This is why we include the “View Transcript” link next to any grade we cannot aggregate, display or normalize. We strongly encourage program directors to consider all IMGs holistically, including those whose transcripts we are unable to normalize due to factors outside of the student’s control, which may include:
- International medical schools often reflect even greater diversity in grading schema due to differing lengths of programs, non-standardized curricula, and variations in course content across years.
- IMG transcripts are more likely to have lower resolution (e.g. low-quality photocopies, background distorting grade visibility, or hand-written transcripts that are hard to read by the human eye), unconventional orientations (such as diagonal or upside-down scans), and language translation issues, which can result in errors, misinterpretations and non-standard processing from students applying from the same school during the same academic year.
- IMGs are also more likely to be reapplicants who graduated several years ago—sometimes even 5 to 10 years prior—compared to U.S. graduates, who are typically upcoming or more recent graduates. This discrepancy means that statistically speaking, fewer transcripts from IMGs will be available for any given academic year, leading to smaller sample sizes which are will result in smaller grade distributions or no distribution at all if falling below the minimal threshold value described above.
These limitations highlight the inherent challenges in normalization across diverse educational backgrounds and grading systems, underscoring the need for ongoing refinement and consideration of these factors in the transcript normalization process. In other words, while grade normalization provides a more “apples to apples” comparison (vs. “apples to orange,” or “apples to bananas” as the current status quo), these efforts would remain imperfect even if this new technology performed with 100% accuracy across each and every medical school worldwide. Regardless, Thalamus and AAMC will continue to work through perceived, relative and absolute limitations in this methodology to innovate based on user feedback, as well as continued advances in the development of the tool and broader technology at-large.
VI. Conclusion
This novel transcript normalization tool in Thalamus Cortex is a prototype released in beta, designed with a focus on product inclusivity, accessibility, transparency and equity. By leveraging a combination of manual review, OCR technology, and advanced machine learning and large language models, we have developed a prototype that aims to enhance the fairness of and efficiency within the residency application process. We acknowledge the complexity of this task and the importance of maintaining trust with residency program directors and other stakeholders. As such, we remain committed to ongoing refinement and improvement of this tool, informed by user feedback and continuous evaluation.
We invite residency program directors and other stakeholders to provide feedback on this beta release to help us refine and enhance the tool further. Our goal is to support you in making well-informed, equitable decisions about your residency applicants to streamline and optimize match outcomes.
[1] https://www.usmle.org/usmle-step-1-transition-passfail-only-score-reporting
[2] https://www.nbome.org/news/comlex-usa-level-1-to-eliminate-numeric-scores/
[3] Ozair A, Bhat V, Detchou DKE. The US Residency Selection Process After the United States Medical Licensing Examination Step 1 Pass/Fail Change: Overview for Applicants and Educators. JMIR Med Educ. 2023 Jan 6;9:e37069. doi: 10.2196/37069. PMID: 36607718; PMCID: PMC9862334.
[4] https://www.aamc.org/career-development/affinity-groups/gsa/medical-student-performance-evaluation
[5] https://www.aamc.org/data-reports/curriculum-reports/data/clerkship-requirements-discipline
[6] Jeremy M. Lipman, Colleen Y. Colbert, Rendell Ashton, Judith French, Christine Warren, Monica Yepes-Rios, Rachel S. King, S. Beth Bierer, Theresa Kline, James K. Stoller; A Systematic Review of Metrics Utilized in the Selection and Prediction of Future Performance of Residents in the United States. J Grad Med Educ 1 December 2023; 15 (6): 652–668. doi: https://doi.org/10.4300/JGME-D-22-00955.1
[7] https://thalamusgme.com/thalamus-first-medical-education-technology-company-to-convert-to-a-public-benefit-corporation/
[8] https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
[9] https://www.aamc.org/services/member-capacity-building/holistic-review
[10] https://www.mededportal.org/doi/10.15766/mep_2374-8265.11299