Data for researchers: Extracted text from more than 72,000 FDA medical device 510(k) summaries

510(k)
datasets
FDA
medical devices
R
regulatory science
Author
Published

2023-07-17

Modified

2024-02-29

Introduction

Every year, the FDA clears around 3,000 medical devices to enter the U.S. market through a review pathway known as the 510(k) program. Descriptions of the data and information that formed the basis for many of these decisions is available in 510(k) summaries, which are posted as PDFs on the FDA’s website after a decision is made. This makes it possible to find information about individual 510(k) clearance decisions, but analyzing the information across the 510(k) program over time has been more difficult because it is only made publicly available through tens of thousands of individual files – until now. Here, you’ll find a dataset with the full text contents of more than 73,000 510(k) summary packages.1 This includes full embedded and OCR text from over 511,000 pages.2 It can be downloaded in either CSV or parquet format.

If you use this, please cite this page.

Note

The information in this dataset is from sources in the public domain. It is provided here “as-is” without warranty of any kind. For the most accurate and up-to-date information, always refer to the FDA website.

Dataset description

Most 510(k) summary packages include:

  • A clearance letter from the FDA

  • An “indications for use” form

  • A 510(k) summary

This dataset provides one row per page for each 510(k) summary package and includes the following fields:

submission_number
The 510(k) or De Novo number for the submission associated with the 510(k) summary package.
date_obtained
The date the 510(k) summary package was obtained from the FDA website. The date is formatted according to ISO 8601.
page_number
The PDF page index from which the text was obtained.
text_embedded
The contents of any text embedded in the PDF. This is obtained using pdftools::pdf_text.
text_ocr
The contents of any text found using optical character recognition (OCR). This is from Tesseract via pdftools::pdf_ocr_text.

Download the data

For finding predicate devices, a site-specific search will probably serve you better than this dataset will. For example, if you are looking for tumor segmentation algorithms:

If you would like help mining this dataset or determining the best regulatory strategy for your product, I’m available as a consultant through NDA Partners or directly (message me on LinkedIn).

The dataset is available as a gzip-compressed CSV file and as a parquet file:

Example of how to access and use the dataset with the R programming language

Here is a sample script in R that downloads and reads the dataset:

# Load libraries -----
library(fs)
library(readr)
library(utils)

# Set up URLs and paths -----
url_download <- 
  "https://www.boleary.com/blog/posts/202307-pmn/data/pmn_summary_text.csv.gz"
path_download <- fs::dir_create("data-raw")
filename_download <- "pmn_summary_text.csv.gz"
filepath_download <- 
  fs::path_expand(paste(path_download, filename_download, sep = "/"))

# Download the data -----
utils::download.file(
  url = url_download,
  destfile = filepath_download
)

# Read in the data -----
# Naming this "pmn_summaries" because "premarket notification (PMN)" is 
# another name for 510(k)s that's a bit easier to use in code. 
pmn_summaries <- 
  readr::read_delim(
    file = filepath_download,
    delim = ";",
    col_types = 
      readr::cols(
        submission_number = readr::col_character(),
        date_obtained = readr::col_date(),
        page_number = readr::col_integer(),
        text_embedded = readr::col_character(),
        text_ocr = readr::col_character()
      )
  )

Expand the code block below for a sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018.

Show the code
# Load and install additional libraries -----
# Install the fdadata package from GitHub if it's missing
if (!require("fdadata")) {
  if (!require("devtools")) install.packages("devtools")
  devtools::install_github("bjoleary/fdadata")
}
library(dplyr)
library(lubridate)
library(stringr)
library(testthat)
library(tidyr)

# Load 510(k) submission metadata and filter to image processing devices -----
submissions_of_interest <- 
  fdadata::pmn |> 
  dplyr::filter(
    # Looking for submissions in product code LLZ for "System, Image 
    # Processing, Radiological"
    .data$product_code == "LLZ",
    # Looking for submissions with a decision date on or after 2008-01-01: 
    .data$date_decision >= lubridate::ymd("2008-01-01"),
    # Looking for submissions with a decision date before 2019-01-01: 
    .data$date_decision < lubridate::ymd("2019-01-01"),
  ) |> 
  # Just keep the submission_number field for this analysis
  dplyr::select("submission_number")

# Filter the pmn_summaries data by joining the submissions_of_interest -----
summaries_to_search <- 
  dplyr::inner_join(
    x = submissions_of_interest, 
    y = pmn_summaries, 
    by = c("submission_number" = "submission_number")
  )

# Set up a search term -----
submission_number_pattern <- 
  stringr::regex(
    # Match the letter "K" followed by exactly 6 numeric digits
    pattern = "K[0-9]{6}",
    # If, instead, you wanted to find both 510(k)s and De Novos, you might 
    # start with a pattern like this: "(K|DEN)[0-9]{6}"
    # Accept either upper- or lower-case "K"s
    ignore_case = TRUE
  )

# Double check that the regular expression search term is behaving as expected
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "Can we find the submission number for K000000?",
      pattern = submission_number_pattern
    ),
  expected = list(c("K000000"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "What if we include a supplement number? K123456/S001",
      pattern = submission_number_pattern
    ),
  expected = list(c("K123456"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "And if it's a lower case K? k180001",
      pattern = submission_number_pattern
    ),
  expected = list(c("k180001"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "This time we will want to see both K123456 and k180001",
      pattern = submission_number_pattern
    ),
  expected = list(c("K123456", "k180001"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "We don't expect it to match a q-submission number like Q123456",
      pattern = submission_number_pattern
    ),
  expected = list(character(0L))
)

# Search for submission numbers
search_results <- 
  summaries_to_search |> 
  # For each page, concatenate the embedded and OCR text so we can create one 
  # string where we have the best chance of finding submission numbers 
  # (we'll de-duplicate the numbers we find later)
  tidyr::unite(
    col = "combined_text",
    c("text_embedded", "text_ocr"),
    sep = " ",
    remove = TRUE,
    na.rm = TRUE
  ) |> 
  # Combine all summary package pages from each submission into a single string
  dplyr::group_by(.data$submission_number) |> 
  dplyr::summarise(
    text = paste(.data$combined_text, collapse = "\\n")
  ) |> 
  # Extract 510(k) submission numbers
  dplyr::mutate(
    submission_referenced =
      stringr::str_extract_all(
        string = .data$text,
        pattern = submission_number_pattern
      ) 
  ) |> 
  # Keep only submission number and results
  dplyr::select(
    "submission_number",
    "submission_referenced"
  ) |> 
  # Make 1 row for each reference found
  tidyr::unnest(cols = c(submission_referenced)) |> 
  # Make sure they are all upper case
  dplyr::mutate(
    submission_referenced = stringr::str_to_upper(.data$submission_referenced)
  ) |> 
  # Remove results where the reference found is the same as the submission 
  # it was found in
  dplyr::filter(.data$submission_number != .data$submission_referenced) |> 
  # Don't double count a reference just because it may have been mentioned 
  # more than once
  dplyr::distinct() |> 
  # Tally it up
  dplyr::group_by(.data$submission_referenced) |> 
  dplyr::tally(name = "references") |> 
  # Put in order of frequency of appearance followed by submission number, 
  # placing more recent submission numbers first
  dplyr::arrange(
    dplyr::desc(.data$references), 
    dplyr::desc(.data$submission_referenced)
  ) |> 
  # Limit to the first five rows
  utils::head(5) |> 
  # Join in some metadata
  dplyr::left_join(
    y = 
      fdadata::pmn |> 
      dplyr::select(
        "submission_number",
        "date_decision",
        "sponsor",
        "device"
      ),
    by = c("submission_referenced" = "submission_number")
  ) 

This produces Table 1.

Table 1: Five submissions frequently referenced in 510(k) summaries for image processing devices cleared from 2008 - 2018
Submission Referenced References Date Decision Sponsor Device
1 K071331 16 2007-05-25 VITAL IMAGES, INC. VITREA VERSION 4.0
2 K120361 12 2012-04-06 FUJIFILM MEDICAL SYSTEMS USA, INC. SYNAPSE 3D BASE TOOLS
3 K073714 12 2008-03-19 ORTHOCRAT, LTD. TRAUMACAD VERSION 2.0
4 K150843 11 2015-04-24 Siemens AG syngo.via (version VB10A)
5 K110300 11 2011-07-01 MATERIALISE DENTAL NV SIMPLANT 2011

Additional considerations

  • PDF Portfolios are not included. A small number of 510(k) summaries are posted as PDF portfolios and may not have been processed correctly or included in this dataset. Based on manual spot-checks, I believe that problems are particularly common when a PDF portfolio includes a fillable version of the indications for use form.

  • Many 510(k) summaries do not include embedded text. Embedded text is not present in many of the 510(k) summary packages, particularly for decisions made many years ago. Both embedded text, when available, and text from OCR should be included for each page in this dataset. Which you choose to use and when may depend on your specific use-case.

  • Many 510(k)s have a 510(k) statement instead of a summary. Not all cleared 510(k)s have 510(k) summary packages on the FDA website. Some manufacturers use a 510(k) statement in lieu of a 510(k) summary, which means they promise to provide safety and effectiveness information within 30 days of a request from any person.3 In addition, the 510(k) Summary/Statement requirement did not exist until the 1990s, so earlier submissions do not have 510(k) summary packages.4

  • 510(k) summaries are not written by the FDA. A 510(k) summary is written by the manufacturer of the device, not by the FDA. Sometimes, the FDA provides considerable input. Other times, the FDA may conduct only a cursory review of a 510(k) summary. Practice has varied over the decades. Sometimes, the manufacturer and the FDA may forget to update the contents of a 510(k) summary at the end of a review after additional information has been provided, and a 510(k) summary may only reflect what was initially provided to the FDA before all questions were resolved. Be cautious about drawing firm conclusions about what was included – or absent – from a 510(k) on the basis of a 510(k) summary.

If you would like help mining this dataset or determining the best regulatory strategy for your product, I’m available as a consultant through NDA Partners or directly (message me on LinkedIn).

Known issues

Table 2: Known issues
Submission Number Issue Date Checked Status
1 K050151 Empty summary 2023-08-14 Not resolved
2 K222386 Wrong submission 2023-08-14 Not resolved
3 K221515 Wrong submission 2023-08-14 Not resolved
4 K211740 Wrong submission 2023-08-14 Not resolved
5 K202565 Wrong submission 2023-08-14 Not resolved
6 K190916 Wrong submission 2023-08-14 Not resolved
7 K190027 Wrong submission 2023-08-14 Not resolved
8 K170825 Wrong submission 2023-08-14 Not resolved
9 K162044 Wrong submission 2023-08-14 Not resolved
10 K900070 Not a 510(k) summary (Complete submission) 2023-08-14 Not resolved
11 K030515 Corrupt PDF 2023-08-14 Not resolved
12 K160695 Corrupt PDF 2023-08-14 Not resolved
13 K173946 Corrupt PDF 2023-08-14 Not resolved
14 K181029 Corrupt PDF 2023-08-14 Not resolved
15 K192198 Corrupt PDF 2023-08-14 Not resolved
16 K202408 Corrupt PDF 2023-08-14 Not resolved
17 K210112 Corrupt PDF 2023-08-14 Not resolved
18 K221619 Corrupt PDF 2023-08-14 Not resolved
19 K210801 Not a 510(k) summary (Decision summary) 2023-08-14 Not resolved
20 K993307 Missing pages 2023-08-14 Not resolved
21 K220672 Empty summary 2023-08-14 Not resolved

Thanks to Jake W. for identifying many of these.

Changelog

  • 2024-02-29:
    • Added additional 510(k) summaries.
  • 2023-12-23:
    • Added additional 510(k) summaries.
  • 2023-10-25:
    • Added additional 510(k) summaries.
  • 2023-08-14:
    • Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using stringr::str_to_upper() before they are counted. Before, for example, “K100001” and “k100001” would have been counted as different submissions because of the difference in case for the “K”. This fix did not change the results presented in Table 1.
    • Fixed a spelling mistake in a footnote.
    • Added Known issues section.
    • Clarified that De Novo reclassification orders are treated as 510(k) summaries for the purposes of this dataset.
    • Added new submissions recently posted to the FDA website.
    • Added old submissions that were previously missing from the dataset. I believe the dataset is now comprehensive as of 2023-08-14.
  • 2023-07-17: Initial publication.

Footnotes

  1. Upon initial publication, this dataset included more than 72,000 510(k) summary packages. This includes De Novos, where the reclassification order is used as the summary.↩︎

  2. Upon initial publication, this dataset included more than 494,000 pages.↩︎

  3. See the FDA’s description of the necessary Content of a 510(k), which describes this in more depth.↩︎

  4. The requirement for a 510(k) summary or a 510(k) statement is from the Safe Medical Devices Act (SMDA) of 1990. The regulation, 21 CFR 807.92, was established through an interim rule with 57 FR 18066 on April 28, 1992 and was finalized with 59 FR 64295 on December 14, 1994.↩︎

Reuse

Citation

BibTeX citation:
@online{o'leary2024,
  author = {O’Leary, Brendan},
  title = {Data for Researchers: {Extracted} Text from More Than 72,000
    {FDA} Medical Device 510(k) Summaries},
  date = {2024-02-29},
  url = {https://www.boleary.com/blog/posts/202307-pmn/},
  langid = {en}
}
For attribution, please cite this work as:
B. O’Leary, “Data for researchers: Extracted text from more than 72,000 FDA medical device 510(k) summaries,” Feb. 29, 2024. https://www.boleary.com/blog/posts/202307-pmn/