Conceptual Overview: The Living Review Pipeline
conceptual_overview.RmdIntroduction: Towards Living Reviews
Systematic reviews and meta-analyses are crucial for synthesizing evidence, but they are time-consuming to create and quickly become outdated. A “living review” aims to continuously incorporate new evidence as it becomes available. This project provides a framework and tools to build an AI-assisted pipeline for creating such living reviews, focusing initially on serum cytokine data during human pregnancy.
The pipeline consists of two core R packages:
-
metawoRld: The foundation for storing, managing, and presenting the structured review data. -
DataFindR: The AI-powered engine for finding, assessing, and extracting data from scientific literature to populate themetawoRldproject.
The metawoRld Package: Structure and Presentation
metawoRld provides the structured environment for the
living review. Its key roles are:
-
Project Initialization: Creates a standardized
project directory structure (
create_metawoRld). -
Schema Definition: Defines the precise data
structure expected for metadata and quantitative results in
_metawoRld.yml. This includes fields, required/optional status, and expected formats (e.g., for outcome groups, measurement methods). -
Data Storage: Stores data for each included study
in a dedicated subdirectory (
data/study_id/) using human-readable and version-control-friendly files:-
metadata.yml: Study design, population, methods, group definitions. -
data.csv: Quantitative cytokine measurements, linked back to metadata.
-
-
Data Validation: Includes functions to validate
incoming data against the project schema (
add_study_data,validate_study,validate_world). -
Data Loading: Provides functions to load the
structured data back into R for analysis
(
load_metawoRld). - Presentation: Includes templates and functions to generate a Quarto website, providing a browsable overview of the included studies and their key findings.
Essentially, metawoRld ensures the data is consistently
structured, validated, and accessible for both analysis and
presentation.
The DataFindR Package: AI-Assisted Data Collection
DataFindR leverages Large Language Models (LLMs) to
automate parts of the literature screening and data extraction process.
It interacts closely with a metawoRld project:
- Input Acquisition: Takes study identifiers (DOIs, PMIDs) as input.
-
Metadata Fetching: Retrieves abstracts and
bibliographic details from online sources
(
df_fetch_metadata). -
Relevance Assessment: Uses an LLM, guided by the
inclusion/exclusion criteria defined in the
metawoRldproject’s_metawoRld.yml, to assess the relevance of a study based on its title and abstract (df_assess_relevance). Provides a score and rationale. Includes an interactive Shiny app (df_shiny_assess) for this step. -
Data Extraction: For relevant papers, uses an LLM,
guided by the detailed data
schemafrom_metawoRld.yml, to extract specific data points (cytokine levels, group sizes, methods, etc.) from the full text (PDF or text). -
Prompt Generation: Creates detailed prompts for the
LLM based on the
metawoRldschema (df_generate_assessment_prompt,df_generate_extraction_prompt). - Output Formatting: Aims to receive structured JSON output from the LLM.
-
Caching: Caches results from metadata fetching and
LLM calls within the
metawoRldproject (.metawoRld_cache/datafindr/) to avoid redundant work and API costs, facilitating collaboration. -
Integration: Parses the extracted data and uses
metawoRld::add_study_data()to save it into the structured project.
DataFindR acts as the intelligent agent feeding
validated, structured data into the metawoRld
repository.
Overall Workflow
The typical workflow using these packages looks like this:
#> Install DiagrammeR to see the workflow diagram.
-
Setup: Create a
metawoRldproject, defining the scope, criteria, and desired data schema in_metawoRld.yml. -
Screening: Use
DataFindR(df_assess_relevanceordf_shiny_assess) to process a list of DOIs/PMIDs, leveraging LLMs and project criteria to identify potentially relevant papers. -
Extraction: For relevant papers, obtain the full
text and use
DataFindR’s extraction capabilities (future function, e.g.,df_extract_from_file) to pull out data according to themetawoRldschema. -
Saving: The extracted and validated data is saved
into the
metawoRldproject structure usingmetawoRld::add_study_data. -
Presentation: Re-render the
metawoRldQuarto website (quarto::quarto_render()) to include the latest studies. -
Analysis: Use
metawoRld::load_metawoRld()to load all data for custom analysis or meta-analysis.
Interaction Between Packages
-
Schema:
metawoRlddefines the schema;DataFindRreads it to generate prompts and structure its output. -
Criteria:
metawoRldstores inclusion/exclusion criteria;DataFindRreads them for the assessment prompt. -
Data Flow:
DataFindRextracts data;metawoRldvalidates and stores it. -
Caching:
DataFindRwrites its cache within the activemetawoRldproject directory for persistence and collaboration.
This interconnected design allows for a modular yet integrated approach to building and maintaining a living review dataset.