Extract Data for a Batch of Studies using LLM — df_extract

Runs the LLM data extraction workflow for multiple studies, taking paths to full-text files as input. Saves the raw JSON output from the LLM to the extraction cache within the metawoRld project. Does not import into metawoRld. Use df_import_batch subsequently to import cached results.

Usage

df_extract_batch(
  chat,
  identifiers,
  paper_paths,
  metawoRld_path,
  force_extract = FALSE,
  stop_on_error = FALSE,
  ellmer_timeout_s = 300,
  ...
)

Arguments

identifiers: Character vector. A vector of DOIs and/or PMIDs for studies identified as relevant (e.g., having an "Include" assessment decision).
paper_paths: Character vector. Named vector or list where names are the identifiers (matching identifiers argument) and values are the file paths to the corresponding full-text plain text (.txt) files.
metawoRld_path: Character string. Path to the root of the metawoRld project.
force_extract: Logical. If TRUE, bypass the extraction cache and re-run LLM extraction even if cached JSON exists. Defaults to FALSE.
stop_on_error: Logical. If TRUE, stop the batch if any single extraction fails. Defaults to FALSE.
...: Additional arguments passed down to the LLM API call function (e.g., temperature, max_tokens).
service: Character string. The LLM service to use (e.g., "openai").
model: Character string. The specific LLM model name (e.g., "gpt-4-turbo").

Value

A data frame (tibble) summarizing the extraction attempt for each identifier, with columns:

identifier: The DOI or PMID.
status: "Success" (LLM ran, JSON saved), "Cached" (used existing cache), "Skipped" (e.g., missing paper path), or "Failure".
cache_file: Path to the saved/checked cache file if status is "Success" or "Cached".
error_message: The error message if status is "Failure".

Also prints progress and summary information.