Skip to contents

Runs the LLM data extraction workflow for multiple studies, taking paths to full-text files as input. Saves the raw JSON output from the LLM to the extraction cache within the metawoRld project. Does not import into metawoRld. Use df_import_batch subsequently to import cached results.

Usage

df_extract_batch(
  chat,
  identifiers,
  paper_paths,
  metawoRld_path,
  force_extract = FALSE,
  stop_on_error = FALSE,
  ellmer_timeout_s = 300,
  ...
)

Arguments

identifiers

Character vector. A vector of DOIs and/or PMIDs for studies identified as relevant (e.g., having an "Include" assessment decision).

paper_paths

Character vector. Named vector or list where names are the identifiers (matching identifiers argument) and values are the file paths to the corresponding full-text plain text (.txt) files.

metawoRld_path

Character string. Path to the root of the metawoRld project.

force_extract

Logical. If TRUE, bypass the extraction cache and re-run LLM extraction even if cached JSON exists. Defaults to FALSE.

stop_on_error

Logical. If TRUE, stop the batch if any single extraction fails. Defaults to FALSE.

...

Additional arguments passed down to the LLM API call function (e.g., temperature, max_tokens).

service

Character string. The LLM service to use (e.g., "openai").

model

Character string. The specific LLM model name (e.g., "gpt-4-turbo").

Value

A data frame (tibble) summarizing the extraction attempt for each identifier, with columns:

identifier

The DOI or PMID.

status

"Success" (LLM ran, JSON saved), "Cached" (used existing cache), "Skipped" (e.g., missing paper path), or "Failure".

cache_file

Path to the saved/checked cache file if status is "Success" or "Cached".

error_message

The error message if status is "Failure".

Also prints progress and summary information.