AI-Augmented Multimodal Risk Prediction of Breast Cancer Using Polygenic Scores, Lifestyle Data, and Temporal EMR Modeling

Anjali Desai

AI-Augmented Multimodal Risk Prediction of Breast Cancer Using Polygenic Scores, Lifestyle Data, and Temporal EMR Modeling

Abstract

This paper proposes a multimodal Artificial Intelligence (AI) system designed to predict personalized breast cancer risk by integrating polygenic risk scores (PRS), lifestyle factors, and electronic medical records (EMRs). Unlike traditional diagnostic methods that rely heavily on imaging, this system estimates lifetime or 10-year risk probabilities to guide proactive and preventive care strategies. The model architecture combines structured data (e.g., SNP-based polygenic scores, BMI, hormone therapy history) with unstructured text (e.g., clinical notes, family history) using a sophisticated ensemble of TabNet and BioBERT. To capture the dynamic nature of a patient's health, time-aware transformers process longitudinal EMR sequences, identifying crucial dependencies across decades of medical history. To ensure clinical trust and utility, the model's interpretability is enhanced via SHAP (Shapley Additive Explanations) and counterfactual reasoning. This advanced approach has the potential to augment traditional risk calculators like the Tyrer-Cuzick or BOADICEA models, supporting more accurate, equitable, and timely early intervention for all populations.

1. Introduction

1.1 Background

Breast cancer remains a leading cause of cancer-related mortality among women globally. Early and accurate risk prediction is paramount, as it enables a shift from reactive treatment to proactive prevention. By identifying individuals at elevated risk before the disease manifests, clinicians can recommend personalized strategies, such as tailored screening schedules, lifestyle modifications, or preventive therapies.

Established tools, such as the Tyrer-Cuzick (IBIS) and BOADICEA models, are currently used to estimate a woman’s risk. These calculators primarily rely on factors like age, body mass index (BMI), hormonal history, and basic family history. While valuable, these models do not capture the complete risk landscape. They often overlook the wealth of information contained in a patient's genome, specifically polygenic risk scores (PRS), which aggregate the small effects of numerous DNA variants. Furthermore, they fail to leverage the rich, longitudinal data within electronic medical records (EMRs), including unstructured clinical notes and temporal patterns in lab results.

Recent research underscores the significant predictive power of integrating these missing data sources. A PRS can distill information from hundreds or thousands of genetic variants into a single, actionable metric of inherited risk. Incorporating PRS and mammographic breast density into the Tyrer-Cuzick model has been shown to improve its predictive accuracy substantially, increasing the Area Under the Curve (AUC) from a modest 0.58 to a more robust 0.67. This demonstrates a significant leap in distinguishing between individuals who will and will not develop breast cancer.

Lifestyle factors and clinical history—such as BMI, age at menarche or menopause, hormone therapy usage, and alcohol consumption—are also critical risk determinants. However, most models treat these factors as static snapshots rather than dynamic variables that evolve. EMRs provide a detailed timeline of this evolution, containing not only structured data but also invaluable unstructured notes from clinicians (e.g., “mother diagnosed with breast cancer at 42” or “patient has dense breast tissue”). This contextual information is vital for a holistic assessment but is typically ignored by traditional risk calculators.

By designing an integrated system that fuses structured genomic and clinical data with unstructured notes and temporal health patterns, we can construct a far more comprehensive and personalized understanding of breast cancer risk. This paper outlines the architecture of such a system and explores its potential to be a practical, interpretable, and powerful tool in preventive medicine.

1.2 Challenges with Current Approaches

While models like Tyrer-Cuzick and BOADICEA are foundational, they have inherent limitations. Their reliance on self-reported or manually entered data makes them susceptible to recall bias and incompleteness. Critically, they generally do not incorporate polygenic data unless a high-penetrance mutation like BRCA1 or BRCA2 is already known. This creates a significant blind spot. An individual with no known family history but a high PRS could be incorrectly classified as low-risk, thereby missing a crucial window for early intervention.

A pressing concern is the issue of fairness and equity. Much of the foundational data used to develop and validate these models originates from populations of European ancestry. Consequently, their predictive accuracy often diminishes when applied to women from diverse racial, ethnic, or geographic backgrounds. Polygenic scores, in particular, are known to perform best for individuals of European descent, exhibiting higher error rates for other populations due to underrepresentation in the initial genome-wide association studies (GWAS).

Another major limitation is the static treatment of a patient's medical history. Health is a dynamic process, yet these models assess risk at a single point in time, ignoring the cumulative impact of health changes over a lifetime. EMRs document this temporal journey—chronicling diagnoses, lab trends, and prescription histories—but this dynamic dimension is largely untapped. Unstructured clinical notes, which often contain crucial details about family history or mammographic findings, are similarly overlooked due to the complexity of natural language processing.

This paper proposes a more sophisticated approach—one that harmonizes these disparate data sources. By integrating structured data, longitudinal EMRs, and free-text narratives, we can achieve a more nuanced and accurate risk profile. The subsequent sections detail how this system can be designed to overcome current limitations and empower clinicians with smarter, earlier decision-making capabilities.

2. Proposed Multimodal Risk Prediction System

The system's core objective is to deliver a holistic and personalized breast cancer risk prediction by unifying a patient's genetic profile, clinical and lifestyle factors, and their complete medical history from EMRs. This is achieved through a multi-branched architecture where diverse data types are processed in parallel before being integrated into a final predictive framework.

2.1 Genetic Risk via Polygenic Risk Scores (PRS)

A polygenic risk score (PRS) quantifies an individual's inherited genetic susceptibility by aggregating the effects of numerous common DNA variations known as single nucleotide polymorphisms (SNPs). While high-impact mutations like BRCA1 are rare, the cumulative effect of many low-impact variants is a primary driver of risk for the general population. The PRS is calculated as a weighted sum of these risk-associated SNPs, providing a continuous measure of genetic liability.

In our system, the PRS serves as a foundational input. It is particularly crucial for identifying women who may have a high genetic predisposition despite a non-indicative family history. Studies have demonstrated that incorporating PRS dramatically improves risk stratification. For instance, women in the top 1% of the PRS distribution have been found to have a lifetime risk several times higher than those with an average PRS, even when other clinical factors appear normal.

2.2 Lifestyle and Clinical Risk Factors

An individual’s lifestyle and clinical history are potent modifiers of their baseline genetic risk. The system processes a range of structured data points, including:

Body Mass Index (BMI)
Reproductive History: Age at menarche, parity (number of children), age at menopause.
Hormone Therapy Use: History of Hormone Replacement Therapy (HRT) or oral contraceptives.
Lifestyle Factors: Documented alcohol consumption and physical activity levels.
Clinical History: Personal or family history of benign breast conditions.
Breast Density: Mammographic density measurements, when available.

Unlike conventional models that treat these factors in isolation, our system is designed to learn the complex, non-linear interactions between them. For example, the combined effect of a high BMI and long-term hormone therapy in a person with a moderate PRS might elevate their risk far more than the sum of the individual factors would suggest.

2.3 Longitudinal EMR Data

EMRs contain a rich temporal narrative of a patient’s health journey, including diagnoses, lab results, medications, and procedures. Most risk models ignore this timeline, treating risk as a static calculation. Our system employs a time-aware sequence model to analyze how a patient's health profile evolves over months, years, or even decades.

This dynamic approach allows the model to recognize clinically meaningful patterns. For example, a record showing a benign biopsy five years prior, followed by steady weight gain and the recent initiation of hormone therapy, is interpreted as a sequential progression of risk factors. This provides a more realistic depiction of risk as a cumulative process. Each entry in the patient's EMR is treated as a time-stamped event, which is fed into a model that learns to recognize long-term patterns indicative of increasing or decreasing risk.

2.4 Clinical Notes and Unstructured Text

Clinicians often record vital information in narrative text, such as "patient's maternal aunt was diagnosed with breast cancer at 38" or "dense fibroglandular tissue noted on last mammogram." This unstructured data is rich with context but is typically inaccessible to standard analytical models.

To unlock this information, the system utilizes a biomedical language model (like BioBERT), pre-trained on vast archives of medical literature and clinical documents. This enables it to comprehend complex medical terminology and extract key entities and relationships from free-text notes. The extracted information is then converted into a structured format (a numerical vector) that the predictive model can use, ensuring that qualitative observations contribute meaningfully to the final risk assessment.

3. Model Design and Interpretability

The fusion of genetics, structured clinical data, unstructured text, and temporal health records necessitates a sophisticated and modular architecture. Each data type is processed by a specialized model, and the outputs are intelligently combined to generate a final, interpretable prediction.

3.1 Structured Data with TabNet

For structured inputs (PRS, BMI, hormone history), we employ TabNet, a deep learning model designed specifically for tabular data. Unlike traditional tree-based models, TabNet uses a sequential attention mechanism to select the most salient features for each prediction. This allows it to learn complex relationships while also providing insight into its decision-making process. This built-in feature selection is a major advantage, as it mimics how a clinician might weigh different factors. After a prediction, we can visualize which features TabNet prioritized, revealing, for example, that a patient's high PRS and age at menopause were the dominant drivers of their risk score.

3.2 Unstructured Text with Biomedical Language Models

The system leverages a biomedical language model (e.g., BioBERT), which is pre-trained on large biomedical corpora, to process unstructured text from clinical notes. These models excel at understanding medical context, terminology, and relationships between concepts. The model transforms a clinical narrative into a dense vector—a numerical representation that encapsulates its semantic meaning. This vector captures critical information, such as family history details or prior biopsy results, and feeds it into the final fusion layer for a comprehensive risk assessment.

3.3 Modeling Time with Temporal Sequences

To model the evolution of a patient's health from EMR data, the system uses a time-aware transformer or a similar temporal sequence model. This architecture processes a patient's medical history as a series of time-stamped events (diagnoses, medications, lab results). Crucially, the model not only considers the order of events but also the duration between them, allowing it to learn the importance of patterns like rapid health changes, long-term trends, or extended gaps in care. The model processes the patient's entire timeline, learns which events are most predictive, and creates a summary representation of their health journey to inform the final risk prediction.

3.4 Fusion of All Modalities

After each data stream is individually processed, the outputs from TabNet, the language model, and the temporal model are combined through data fusion. A common and effective technique is to concatenate these intermediate representations and pass them through a final set of neural network layers. This ensemble approach allows the model to learn cross-modal interactions. For instance, it can discover that a patient with a moderate PRS who has a family history mentioned in clinical notes and exhibits recent weight gain in their EMR timeline has a significantly higher risk than any single modality would suggest. This holistic fusion ensures that no piece of information is overlooked, whether it originates from a genetic test, a structured form, or a physician's narrative.

4. Interpreting Predictions

In clinical medicine, a "black box" prediction is insufficient. For doctors and patients to trust and act on an AI-driven risk score, they must understand the reasoning behind it. Therefore, our system integrates two powerful interpretability tools: feature attribution and counterfactual reasoning.

4.1 Understanding What Contributes to Risk (SHAP Values)

Once the system generates a risk score, it uses SHAP (Shapley Additive Explanations) to deconstruct the prediction. SHAP assigns an importance value to every input feature, quantifying how much it contributed to pushing the final risk score up or down from a baseline average.

For a patient predicted to have a 20% 10-year risk, the SHAP breakdown might be visualized as:

Baseline Risk: 10%
Factors Increasing Risk:
- Polygenic Risk Score (High): +7%
- High BMI: +4%
- Family History (from notes): +3%
Factors Decreasing Risk:
- No History of Hormone Therapy: -2%
- Age at First Childbirth (later): -1%
Final Predicted Risk: 21%

This transparent breakdown helps clinicians pinpoint the primary drivers of risk, facilitating a more focused and effective patient consultation.

4.2 Exploring "What-If" Scenarios (Counterfactuals)

Counterfactual reasoning provides an interactive way to explore how risk could change if certain factors were different. The system can answer critical "what-if" questions, such as:

"What would the patient's risk be if their BMI was reduced from 32 to 26?"
"How would the risk change if they had not undergone hormone therapy?"
"What is the risk contribution of the family history component alone?"

By simulating these modifications and re-calculating the prediction, the system provides a quantitative estimate of the impact of modifiable risk factors. For example, showing that lowering BMI could reduce a patient's 10-year risk by 5% provides a powerful, data-driven motivation for lifestyle changes. Together, SHAP and counterfactuals transform the system from a static calculator into a dynamic decision-support tool.

5. Potential Benefits, Challenges, and Solutions

5.1 Why This Model Could Be Better

The proposed system offers several transformative advantages over existing risk assessment tools:

Holistic Data Integration: It leverages a richer, more diverse set of inputs—genomics, clinical data, unstructured text, and temporal patterns—for a truly comprehensive assessment.
Dynamic Personalization: Each risk score is deeply personalized and can be updated dynamically as new information is added to a patient's EMR, reflecting their evolving health status.
Enhanced Equity: With deliberate training on diverse, multi-ancestral datasets and continuous monitoring with fairness metrics, the model can be designed to provide accurate predictions across different populations, mitigating existing biases.
Proactive and Interpretable: The system not only predicts risk but also explains its reasoning and allows for the exploration of preventive "what-if" scenarios, empowering both clinicians and patients.

5.2 Key Challenges and How to Handle Them

Data Quality and Availability: EMR data can be incomplete or inconsistent. Not all patients will have genetic testing or structured family history records.
- Solution: The model must be robust to missing data, using available information to make the best possible prediction. It should also be able to flag areas where additional data (e.g., a genetic test) could significantly improve the accuracy of the risk assessment.
Bias and Fairness: As noted, PRS and other clinical models often exhibit performance disparities across different ancestral groups.
- Solution: Addressing this requires a multi-pronged strategy: actively curating diverse training datasets, using advanced techniques like transfer learning to adapt models to underrepresented populations, and embedding fairness metrics directly into the model validation process to ensure equitable performance.
Integration into Clinical Workflows: Introducing a new tool into a busy clinical environment can be disruptive.
- Solution: The system should be designed for seamless integration within existing EMR platforms. Outputs must be presented in a simple, intuitive, and visual format that is easy for clinicians to interpret and communicate to patients.
Privacy and Security: Combining sensitive genomic and health data raises significant privacy concerns.
- Solution: The system must adhere to all relevant data privacy regulations (like HIPAA). Techniques such as federated learning, where the model is trained on decentralized data without moving it, and strong data encryption are essential to protect patient confidentiality. Clear patient consent is a mandatory prerequisite.

5.3 Looking Ahead: Future Extensions

While this paper focuses on breast cancer, the multimodal framework is highly adaptable. It could be extended to predict risk for other complex diseases such as ovarian cancer, cardiovascular disease, or type 2 diabetes.

Future iterations could also incorporate imaging data. Integrating features from mammograms, ultrasounds, or MRIs could further enhance predictive accuracy by combining underlying risk factors with early radiological signs of disease.

Ultimately, this system could evolve into a comprehensive "risk dashboard" for patients and providers. This tool would provide regular updates on a person's health risks, offer clear explanations of the contributing factors, and suggest personalized, actionable steps to mitigate those risks, empowering individuals to take an active role in their long-term health management.

6. Conclusion

This paper outlines a next-generation framework for breast cancer risk prediction that moves beyond static, limited-variable models. By synergistically integrating polygenic scores, lifestyle factors, unstructured clinical notes, and longitudinal EMR data, the proposed AI system can generate a risk estimate that is more accurate, personalized, and comprehensive.

Unlike traditional calculators, this system is dynamic—it learns from new data, recognizes subtle health trends over time, and explains its predictions through interpretable tools like SHAP and counterfactuals. Its primary goal is to empower clinicians and patients with the foresight needed to make informed, timely decisions about preventive care.

While significant work is required for development, validation, and clinical integration, the potential is immense. Such a system could fundamentally shift the paradigm of cancer care from reacting to disease to proactively managing health, ultimately fostering an era of earlier, more equitable, and more effective prevention.

7. Bibliography

Roberts, M. C., et al. (2023). "Polygenic risk scores in breast cancer: A review of their potential for risk-stratified screening and prevention." Journal of Clinical Oncology. This review highlights that PRS can explain over 30% of breast cancer heritability and significantly enhance risk prediction.
Petitjean, H., et al. (2025). "Integration of Polygenic Risk Scores in Breast and Ovarian Cancer Risk Prediction Models." Journal of the National Cancer Institute. Found PRS to be the single most discriminative risk factor, achieving an AUC of approximately 0.65 when integrated into models.
Google Cloud Vertex AI. "TabNet on Vertex AI." Google Cloud Documentation. Details how TabNet provides scalable and interpretable model behavior with built-in feature importance insights, which is valuable for clinical integration. Available at: https://cloud.google.com/vertex-ai/docs/tabular-data/tabnet/overview
Lee, J., et al. (2019). "BioBERT: a pre-trained biomedical language representation model for biomedical text mining." arXiv preprint arXiv:1901.08746. BioBERT, pre-trained on large biomedical corpora, significantly outperforms standard BERT on biomedical tasks, making it ideal for extracting clinical data. Available at: https://arxiv.org/abs/1901.08746
Shang, X., et al. (2023). "Generalizability of a 313-variant polygenic risk score for breast cancer in diverse populations." medRxiv. This study assesses the performance of PRS313 across European, Latinx, African, and Asian American women, noting that AUC is lower in non-European populations and emphasizing the critical need for diverse datasets. Available at: https://www.medrxiv.org/content/10.1101/2023.03.10.23287042v1

Course - Model NITI Aayog Publish A White Paper

AI-Augmented Multimodal Risk Prediction of Breast Cancer Using Polygenic Scores, Lifestyle Data, and Temporal EMR Modeling

AI-Augmented Multimodal Risk Prediction of Breast Cancer Using Polygenic Scores, Lifestyle Data, and Temporal EMR Modeling

Abstract

1. Introduction

1.1 Background

1.2 Challenges with Current Approaches

2. Proposed Multimodal Risk Prediction System

2.1 Genetic Risk via Polygenic Risk Scores (PRS)

2.2 Lifestyle and Clinical Risk Factors

2.3 Longitudinal EMR Data

2.4 Clinical Notes and Unstructured Text

3. Model Design and Interpretability

3.1 Structured Data with TabNet

3.2 Unstructured Text with Biomedical Language Models

3.3 Modeling Time with Temporal Sequences

3.4 Fusion of All Modalities

4. Interpreting Predictions

4.1 Understanding What Contributes to Risk (SHAP Values)

4.2 Exploring "What-If" Scenarios (Counterfactuals)

5. Potential Benefits, Challenges, and Solutions

5.1 Why This Model Could Be Better

5.2 Key Challenges and How to Handle Them

5.3 Looking Ahead: Future Extensions

6. Conclusion

7. Bibliography

Course - Model NITI Aayog
Publish A White Paper