Extract Data for a Batch of Studies using LLM
df_extract_batch.Rd
Runs the LLM data extraction workflow for multiple studies, taking paths to
full-text files as input. Saves the raw JSON output from the LLM to the
extraction cache within the metawoRld
project. Does not import into metawoRld.
Use df_import_batch
subsequently to import cached results.
Usage
df_extract_batch(
chat,
identifiers,
paper_paths,
metawoRld_path,
force_extract = FALSE,
stop_on_error = FALSE,
ellmer_timeout_s = 300,
...
)
Arguments
- identifiers
Character vector. A vector of DOIs and/or PMIDs for studies identified as relevant (e.g., having an "Include" assessment decision).
- paper_paths
Character vector. Named vector or list where names are the identifiers (matching
identifiers
argument) and values are the file paths to the corresponding full-text plain text (.txt) files.- metawoRld_path
Character string. Path to the root of the metawoRld project.
- force_extract
Logical. If TRUE, bypass the extraction cache and re-run LLM extraction even if cached JSON exists. Defaults to FALSE.
- stop_on_error
Logical. If TRUE, stop the batch if any single extraction fails. Defaults to FALSE.
- ...
Additional arguments passed down to the LLM API call function (e.g.,
temperature
,max_tokens
).- service
Character string. The LLM service to use (e.g., "openai").
- model
Character string. The specific LLM model name (e.g., "gpt-4-turbo").
Value
A data frame (tibble) summarizing the extraction attempt for each identifier, with columns:
identifier
The DOI or PMID.
status
"Success" (LLM ran, JSON saved), "Cached" (used existing cache), "Skipped" (e.g., missing paper path), or "Failure".
cache_file
Path to the saved/checked cache file if status is "Success" or "Cached".
error_message
The error message if status is "Failure".
Also prints progress and summary information.