Data for researchers: Extracted text from more than 72,000 FDA medical device 510(k) summaries

Introduction

Every year, the FDA clears around 3,000 medical devices to enter the U.S. market through a review pathway known as the 510(k) program. Descriptions of the data and information that formed the basis for many of these decisions is available in 510(k) summaries, which are posted as PDFs on the FDA’s website after a decision is made. This makes it possible to find information about individual 510(k) clearance decisions, but analyzing the information across the 510(k) program over time has been more difficult because it is only made publicly available through tens of thousands of individual files – until now. Here, you’ll find a dataset with the full text contents of more than 85,000 510(k) clearance packages.¹ This includes full embedded and OCR text from over 560,000 pages.² It can be downloaded in either CSV or parquet format.

If you use this, please cite this page.

Note

The information in this dataset is from sources in the public domain. It is provided here “as-is” without warranty of any kind. For the most accurate and up-to-date information, always refer to the FDA website.

Dataset description

Note

When originally published, this dataset focused on 510(k) summaries. It now also includes clearance packages for many submissions where a 510(k) statement was used instead of a 510(k) summary. For these submissions, the clearance letter and indications for use are included in this dataset.

Most 510(k) clearance packages include:

A clearance letter from the FDA
An “indications for use” form
A 510(k) summary

For submissions where a “510(k) Statement” is used, only the clearance letter and the indications for use form are present.³

This dataset provides one row per page for each 510(k) clearance package and includes the following fields:

submission_number: The 510(k) or De Novo number for the submission associated with the 510(k) clearance package.
date_obtained: The date the 510(k) clearance package was obtained from the FDA website. The date is formatted according to ISO 8601.
page_number: The PDF page index from which the text was obtained.
text_embedded: The contents of any text embedded in the PDF. This is obtained using pdftools::pdf_text.
text_ocr: The contents of any text found using optical character recognition (OCR). This is from Tesseract via pdftools::pdf_ocr_text.

Download the data

Click here if you just want to find a predicate device

For finding predicate devices, a site-specific search will probably serve you better than this dataset will. For example, if you are looking for tumor segmentation algorithms:

Google: site:accessdata.fda.gov/cdrh_docs/ “tumor segmentation”
DuckDuckGo: site:accessdata.fda.gov/cdrh_docs/ tumor segmentation
Bing: site:accessdata.fda.gov/cdrh_docs/ tumor segmentation

FDA Regulatory Expertise:
I provide digital health and medical device regulatory strategy solutions to technology developers, healthcare organizations, trade and professional associations, and others. Book a meeting.

The dataset is available as a gzip-compressed CSV file and as a parquet file:

pmn_summary_text.csv.gz (Size: 272M, MD5: d1816db9bcb2c1eca24064edef0c1a5f)
pmn_summary_text.parquet (Size: 642M, MD5: ed0a9d1cf5b232bcc00ffcb52d636cfc)

Example of how to access and use the dataset with the R programming language

Here is a sample script in R that downloads and reads the dataset:

# Load libraries -----
library(fs)
library(readr)
library(utils)

# Set up URLs and paths -----
url_download <- 
  "https://www.boleary.com/blog/posts/202307-pmn/data/pmn_summary_text.csv.gz"
path_download <- fs::dir_create("data-raw")
filename_download <- "pmn_summary_text.csv.gz"
filepath_download <- 
  fs::path_expand(paste(path_download, filename_download, sep = "/"))

# Download the data -----
utils::download.file(
  url = url_download,
  destfile = filepath_download
)

# Read in the data -----
# Naming this "pmn_summaries" because "premarket notification (PMN)" is 
# another name for 510(k)s that's a bit easier to use in code. 
pmn_summaries <- 
  readr::read_delim(
    file = filepath_download,
    delim = ";",
    col_types = 
      readr::cols(
        submission_number = readr::col_character(),
        date_obtained = readr::col_date(),
        page_number = readr::col_integer(),
        text_embedded = readr::col_character(),
        text_ocr = readr::col_character()
      )
  )

Expand the code block below for a sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018.

Show the code

# Load and install additional libraries -----
# Install the fdadata package from GitHub if it's missing
if (!require("fdadata")) {
  if (!require("devtools")) install.packages("devtools")
  devtools::install_github("bjoleary/fdadata")
}
library(dplyr)
library(lubridate)
library(stringr)
library(testthat)
library(tidyr)

# Load 510(k) submission metadata and filter to image processing devices -----
submissions_of_interest <- 
  fdadata::pmn |> 
  dplyr::filter(
    # Looking for submissions in product code LLZ for "System, Image 
    # Processing, Radiological"
    .data$product_code == "LLZ",
    # Looking for submissions with a decision date on or after 2008-01-01: 
    .data$date_decision >= lubridate::ymd("2008-01-01"),
    # Looking for submissions with a decision date before 2019-01-01: 
    .data$date_decision < lubridate::ymd("2019-01-01"),
  ) |> 
  # Just keep the submission_number field for this analysis
  dplyr::select("submission_number")

# Filter the pmn_summaries data by joining the submissions_of_interest -----
summaries_to_search <- 
  dplyr::inner_join(
    x = submissions_of_interest, 
    y = pmn_summaries, 
    by = c("submission_number" = "submission_number")
  )

# Set up a search term -----
submission_number_pattern <- 
  stringr::regex(
    # Match the letter "K" followed by exactly 6 numeric digits
    pattern = "K[0-9]{6}",
    # If, instead, you wanted to find both 510(k)s and De Novos, you might 
    # start with a pattern like this: "(K|DEN)[0-9]{6}"
    # Accept either upper- or lower-case "K"s
    ignore_case = TRUE
  )

# Double check that the regular expression search term is behaving as expected
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "Can we find the submission number for K000000?",
      pattern = submission_number_pattern
    ),
  expected = list(c("K000000"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "What if we include a supplement number? K123456/S001",
      pattern = submission_number_pattern
    ),
  expected = list(c("K123456"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "And if it's a lower case K? k180001",
      pattern = submission_number_pattern
    ),
  expected = list(c("k180001"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "This time we will want to see both K123456 and k180001",
      pattern = submission_number_pattern
    ),
  expected = list(c("K123456", "k180001"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "We don't expect it to match a q-submission number like Q123456",
      pattern = submission_number_pattern
    ),
  expected = list(character(0L))
)

# Search for submission numbers
search_results <- 
  summaries_to_search |> 
  # For each page, concatenate the embedded and OCR text so we can create one 
  # string where we have the best chance of finding submission numbers 
  # (we'll de-duplicate the numbers we find later)
  tidyr::unite(
    col = "combined_text",
    c("text_embedded", "text_ocr"),
    sep = " ",
    remove = TRUE,
    na.rm = TRUE
  ) |> 
  # Combine all clearance package pages from each submission into a single string
  dplyr::group_by(.data$submission_number) |> 
  dplyr::summarise(
    text = paste(.data$combined_text, collapse = "\\n")
  ) |> 
  # Extract 510(k) submission numbers
  dplyr::mutate(
    submission_referenced =
      stringr::str_extract_all(
        string = .data$text,
        pattern = submission_number_pattern
      ) 
  ) |> 
  # Keep only submission number and results
  dplyr::select(
    "submission_number",
    "submission_referenced"
  ) |> 
  # Make 1 row for each reference found
  tidyr::unnest(cols = c(submission_referenced)) |> 
  # Make sure they are all upper case
  dplyr::mutate(
    submission_referenced = stringr::str_to_upper(.data$submission_referenced)
  ) |> 
  # Remove results where the reference found is the same as the submission 
  # it was found in
  dplyr::filter(.data$submission_number != .data$submission_referenced) |> 
  # Don't double count a reference just because it may have been mentioned 
  # more than once
  dplyr::distinct() |> 
  # Tally it up
  dplyr::group_by(.data$submission_referenced) |> 
  dplyr::tally(name = "references") |> 
  # Put in order of frequency of appearance followed by submission number, 
  # placing more recent submission numbers first
  dplyr::arrange(
    dplyr::desc(.data$references), 
    dplyr::desc(.data$submission_referenced)
  ) |> 
  # Limit to the first five rows
  utils::head(5) |> 
  # Join in some metadata
  dplyr::left_join(
    y = 
      fdadata::pmn |> 
      dplyr::select(
        "submission_number",
        "date_decision",
        "sponsor",
        "device"
      ),
    by = c("submission_referenced" = "submission_number")
  )

This produces Table 1.

Table 1: Five submissions frequently referenced in 510(k) summaries for image processing devices cleared from 2008 - 2018

	Submission Referenced	References	Date Decision	Sponsor	Device
1	K071331	16	2007-05-25	VITAL IMAGES, INC.	VITREA VERSION 4.0
2	K120361	12	2012-04-06	FUJIFILM MEDICAL SYSTEMS USA, INC.	SYNAPSE 3D BASE TOOLS
3	K073714	12	2008-03-19	ORTHOCRAT, LTD.	TRAUMACAD VERSION 2.0
4	K150843	11	2015-04-24	Siemens AG	syngo.via (version VB10A)
5	K110300	11	2011-07-01	MATERIALISE DENTAL NV	SIMPLANT 2011

Additional considerations

PDF Portfolios are not included. A small number of 510(k) summaries are posted as PDF portfolios and may not have been processed correctly or included in this dataset. Based on manual spot-checks, I believe that problems are particularly common when a PDF portfolio includes a fillable version of the indications for use form.
Many 510(k) summaries do not include embedded text. Embedded text is not present in many of the 510(k) clearance packages, particularly for decisions made many years ago. Both embedded text, when available, and text from OCR should be included for each page in this dataset. Which you choose to use and when may depend on your specific use-case.
Many 510(k)s have a 510(k) statement instead of a summary. Not all cleared 510(k)s have 510(k) clearance packages on the FDA website. Some manufacturers use a 510(k) statement in lieu of a 510(k) summary, which means they promise to provide safety and effectiveness information within 30 days of a request from any person.⁴ In addition, the 510(k) Summary/Statement requirement did not exist until the 1990s, so earlier submissions do not have 510(k) clearance packages.⁵
510(k) summaries are not written by the FDA. A 510(k) summary is written by the manufacturer of the device, not by the FDA. Sometimes, the FDA provides considerable input. Other times, the FDA may conduct only a cursory review of a 510(k) summary. Practice has varied over the decades. Sometimes, the manufacturer and the FDA may forget to update the contents of a 510(k) summary at the end of a review after additional information has been provided, and a 510(k) summary may only reflect what was initially provided to the FDA before all questions were resolved. Be cautious about drawing firm conclusions about what was included – or absent – from a 510(k) on the basis of a 510(k) summary.

If you would like help mining this dataset or determining the best regulatory strategy for your product, I’m available as a consultant through NDA Partners or directly (message me on LinkedIn).

Known issues

Table 2: Known issues

	Submission Number	Issue	Date Checked	Status
1	K050151	Empty summary	2023-08-14	Not resolved
2	K222386	Wrong submission	2023-08-14	Not resolved
3	K221515	Wrong submission	2023-08-14	Not resolved
4	K211740	Wrong submission	2023-08-14	Not resolved
5	K202565	Wrong submission	2023-08-14	Not resolved
6	K190916	Wrong submission	2023-08-14	Not resolved
7	K190027	Wrong submission	2023-08-14	Not resolved
8	K170825	Wrong submission	2023-08-14	Not resolved
9	K162044	Wrong submission	2023-08-14	Not resolved
10	K900070	Not a 510(k) summary (Complete submission)	2023-08-14	Not resolved
11	K030515	Corrupt PDF	2023-08-14	Not resolved
12	K160695	Corrupt PDF	2023-08-14	Not resolved
13	K173946	Corrupt PDF	2023-08-14	Not resolved
14	K181029	Corrupt PDF	2023-08-14	Not resolved
15	K192198	Corrupt PDF	2023-08-14	Not resolved
16	K202408	Corrupt PDF	2023-08-14	Not resolved
17	K210112	Corrupt PDF	2023-08-14	Not resolved
18	K221619	Corrupt PDF	2023-08-14	Not resolved
19	K210801	Not a 510(k) summary (Decision summary)	2023-08-14	Not resolved
20	K993307	Missing pages	2023-08-14	Not resolved
21	K220672	Empty summary	2023-08-14	Not resolved

Thanks to Jake W. for identifying many of these.

Changelog

2024-07-08:
- Added additional 510(k) clearance packages.
- Changed parquet compression from gzip to snappy.
2024-06-27:
- Added additional 510(k) clearance packages.
2024-06-05:
- Added clearance packages for 510(k)s with 510(k) Statements instead of summaries. These include the clearance letter and the indications for use and do not include a 510(k) summary.
2024-05-10:
- Added additional 510(k) summaries.
2024-02-29:
- Added additional 510(k) summaries.
2023-12-23:
- Added additional 510(k) summaries.
2023-10-25:
- Added additional 510(k) summaries.
2023-08-14:
- Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using stringr::str_to_upper() before they are counted. Before, for example, “K100001” and “k100001” would have been counted as different submissions because of the difference in case for the “K”. This fix did not change the results presented in Table 1.
- Fixed a spelling mistake in a footnote.
- Added Known issues section.
- Clarified that De Novo reclassification orders are treated as 510(k) summaries for the purposes of this dataset.
- Added new submissions recently posted to the FDA website.
- Added old submissions that were previously missing from the dataset. I believe the dataset is now comprehensive as of 2023-08-14.
2023-07-17: Initial publication.

Footnotes

Upon initial publication, this dataset included more than 72,000 510(k) summary packages. This includes De Novos, where the reclassification order was used as the summary. Earlier versions of this dataset did not include clearance packages for 510(k)s where a 510(k) Statement was used.↩︎
Upon initial publication, this dataset included more than 494,000 pages.↩︎
See: https://www.fda.gov/medical-devices/premarket-notification-510k/content-510k#link_7.↩︎
See the FDA’s description of the necessary Content of a 510(k), which describes this in more depth.↩︎
The requirement for a 510(k) summary or a 510(k) statement is from the Safe Medical Devices Act (SMDA) of 1990. The regulation, 21 CFR 807.92, was established through an interim rule with 57 FR 18066 on April 28, 1992 and was finalized with 59 FR 64295 on December 14, 1994.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{o'leary2024,
  author = {O’Leary, Brendan},
  title = {Data for Researchers: {Extracted} Text from More Than 72,000
    {FDA} Medical Device 510(k) Summaries},
  date = {2024-06-05},
  url = {https://www.boleary.com/blog/posts/202307-pmn/},
  langid = {en}
}

For attribution, please cite this work as:

B. O’Leary, “Data for researchers: Extracted text from more than 72,000 FDA medical device 510(k) summaries,” Jun. 05, 2024. https://www.boleary.com/blog/posts/202307-pmn/