# Load libraries -----
library(fs)
library(readr)
library(utils)
# Set up URLs and paths -----
<-
url_download "https://www.boleary.com/blog/posts/202307-pmn/data/pmn_summary_text.csv.gz"
<- fs::dir_create("data-raw")
path_download <- "pmn_summary_text.csv.gz"
filename_download <-
filepath_download ::path_expand(paste(path_download, filename_download, sep = "/"))
fs
# Download the data -----
::download.file(
utilsurl = url_download,
destfile = filepath_download
)
# Read in the data -----
# Naming this "pmn_summaries" because "premarket notification (PMN)" is
# another name for 510(k)s that's a bit easier to use in code.
<-
pmn_summaries ::read_delim(
readrfile = filepath_download,
delim = ";",
col_types =
::cols(
readrsubmission_number = readr::col_character(),
date_obtained = readr::col_date(),
page_number = readr::col_integer(),
text_embedded = readr::col_character(),
text_ocr = readr::col_character()
) )
Introduction
Every year, the FDA clears around 3,000 medical devices to enter the U.S. market through a review pathway known as the 510(k) program. Descriptions of the data and information that formed the basis for many of these decisions is available in 510(k) summaries, which are posted as PDFs on the FDA’s website after a decision is made. This makes it possible to find information about individual 510(k) clearance decisions, but analyzing the information across the 510(k) program over time has been more difficult because it is only made publicly available through tens of thousands of individual files – until now. Here, you’ll find a dataset with the full text contents of more than 73,000 510(k) summary packages.1 This includes full embedded and OCR text from over 511,000 pages.2 It can be downloaded in either CSV or parquet format.
If you use this, please cite this page.
The information in this dataset is from sources in the public domain. It is provided here “as-is” without warranty of any kind. For the most accurate and up-to-date information, always refer to the FDA website.
Dataset description
Most 510(k) summary packages include:
A clearance letter from the FDA
An “indications for use” form
A 510(k) summary
This dataset provides one row per page for each 510(k) summary package and includes the following fields:
submission_number
- The 510(k) or De Novo number for the submission associated with the 510(k) summary package.
date_obtained
- The date the 510(k) summary package was obtained from the FDA website. The date is formatted according to ISO 8601.
page_number
- The PDF page index from which the text was obtained.
text_embedded
-
The contents of any text embedded in the PDF. This is obtained using
pdftools::pdf_text
. text_ocr
-
The contents of any text found using optical character recognition (OCR). This is from Tesseract via
pdftools::pdf_ocr_text
.
Download the data
For finding predicate devices, a site-specific search will probably serve you better than this dataset will. For example, if you are looking for tumor segmentation algorithms:
Google: site:accessdata.fda.gov/cdrh_docs/ “tumor segmentation”
DuckDuckGo: site:accessdata.fda.gov/cdrh_docs/ tumor segmentation
If you would like help mining this dataset or determining the best regulatory strategy for your product, I’m available as a consultant through NDA Partners or directly (message me on LinkedIn).
The dataset is available as a gzip-compressed CSV file and as a parquet file:
pmn_summary_text.csv.gz (237 MB)
pmn_summary_text.parquet (321 MB)
Example of how to access and use the dataset with the R programming language
Here is a sample script in R that downloads and reads the dataset:
Expand the code block below for a sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018.
Show the code
# Load and install additional libraries -----
# Install the fdadata package from GitHub if it's missing
if (!require("fdadata")) {
if (!require("devtools")) install.packages("devtools")
::install_github("bjoleary/fdadata")
devtools
}library(dplyr)
library(lubridate)
library(stringr)
library(testthat)
library(tidyr)
# Load 510(k) submission metadata and filter to image processing devices -----
<-
submissions_of_interest ::pmn |>
fdadata::filter(
dplyr# Looking for submissions in product code LLZ for "System, Image
# Processing, Radiological"
$product_code == "LLZ",
.data# Looking for submissions with a decision date on or after 2008-01-01:
$date_decision >= lubridate::ymd("2008-01-01"),
.data# Looking for submissions with a decision date before 2019-01-01:
$date_decision < lubridate::ymd("2019-01-01"),
.data|>
) # Just keep the submission_number field for this analysis
::select("submission_number")
dplyr
# Filter the pmn_summaries data by joining the submissions_of_interest -----
<-
summaries_to_search ::inner_join(
dplyrx = submissions_of_interest,
y = pmn_summaries,
by = c("submission_number" = "submission_number")
)
# Set up a search term -----
<-
submission_number_pattern ::regex(
stringr# Match the letter "K" followed by exactly 6 numeric digits
pattern = "K[0-9]{6}",
# If, instead, you wanted to find both 510(k)s and De Novos, you might
# start with a pattern like this: "(K|DEN)[0-9]{6}"
# Accept either upper- or lower-case "K"s
ignore_case = TRUE
)
# Double check that the regular expression search term is behaving as expected
::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "Can we find the submission number for K000000?",
pattern = submission_number_pattern
),expected = list(c("K000000"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "What if we include a supplement number? K123456/S001",
pattern = submission_number_pattern
),expected = list(c("K123456"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "And if it's a lower case K? k180001",
pattern = submission_number_pattern
),expected = list(c("k180001"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "This time we will want to see both K123456 and k180001",
pattern = submission_number_pattern
),expected = list(c("K123456", "k180001"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "We don't expect it to match a q-submission number like Q123456",
pattern = submission_number_pattern
),expected = list(character(0L))
)
# Search for submission numbers
<-
search_results |>
summaries_to_search # For each page, concatenate the embedded and OCR text so we can create one
# string where we have the best chance of finding submission numbers
# (we'll de-duplicate the numbers we find later)
::unite(
tidyrcol = "combined_text",
c("text_embedded", "text_ocr"),
sep = " ",
remove = TRUE,
na.rm = TRUE
|>
) # Combine all summary package pages from each submission into a single string
::group_by(.data$submission_number) |>
dplyr::summarise(
dplyrtext = paste(.data$combined_text, collapse = "\\n")
|>
) # Extract 510(k) submission numbers
::mutate(
dplyrsubmission_referenced =
::str_extract_all(
stringrstring = .data$text,
pattern = submission_number_pattern
) |>
) # Keep only submission number and results
::select(
dplyr"submission_number",
"submission_referenced"
|>
) # Make 1 row for each reference found
::unnest(cols = c(submission_referenced)) |>
tidyr# Make sure they are all upper case
::mutate(
dplyrsubmission_referenced = stringr::str_to_upper(.data$submission_referenced)
|>
) # Remove results where the reference found is the same as the submission
# it was found in
::filter(.data$submission_number != .data$submission_referenced) |>
dplyr# Don't double count a reference just because it may have been mentioned
# more than once
::distinct() |>
dplyr# Tally it up
::group_by(.data$submission_referenced) |>
dplyr::tally(name = "references") |>
dplyr# Put in order of frequency of appearance followed by submission number,
# placing more recent submission numbers first
::arrange(
dplyr::desc(.data$references),
dplyr::desc(.data$submission_referenced)
dplyr|>
) # Limit to the first five rows
::head(5) |>
utils# Join in some metadata
::left_join(
dplyry =
::pmn |>
fdadata::select(
dplyr"submission_number",
"date_decision",
"sponsor",
"device"
),by = c("submission_referenced" = "submission_number")
)
This produces Table 1.
Submission Referenced | References | Date Decision | Sponsor | Device | |
---|---|---|---|---|---|
1 | K071331 | 16 | 2007-05-25 | VITAL IMAGES, INC. | VITREA VERSION 4.0 |
2 | K120361 | 12 | 2012-04-06 | FUJIFILM MEDICAL SYSTEMS USA, INC. | SYNAPSE 3D BASE TOOLS |
3 | K073714 | 12 | 2008-03-19 | ORTHOCRAT, LTD. | TRAUMACAD VERSION 2.0 |
4 | K150843 | 11 | 2015-04-24 | Siemens AG | syngo.via (version VB10A) |
5 | K110300 | 11 | 2011-07-01 | MATERIALISE DENTAL NV | SIMPLANT 2011 |
Additional considerations
PDF Portfolios are not included. A small number of 510(k) summaries are posted as PDF portfolios and may not have been processed correctly or included in this dataset. Based on manual spot-checks, I believe that problems are particularly common when a PDF portfolio includes a fillable version of the indications for use form.
Many 510(k) summaries do not include embedded text. Embedded text is not present in many of the 510(k) summary packages, particularly for decisions made many years ago. Both embedded text, when available, and text from OCR should be included for each page in this dataset. Which you choose to use and when may depend on your specific use-case.
Many 510(k)s have a 510(k) statement instead of a summary. Not all cleared 510(k)s have 510(k) summary packages on the FDA website. Some manufacturers use a 510(k) statement in lieu of a 510(k) summary, which means they promise to provide safety and effectiveness information within 30 days of a request from any person.3 In addition, the 510(k) Summary/Statement requirement did not exist until the 1990s, so earlier submissions do not have 510(k) summary packages.4
510(k) summaries are not written by the FDA. A 510(k) summary is written by the manufacturer of the device, not by the FDA. Sometimes, the FDA provides considerable input. Other times, the FDA may conduct only a cursory review of a 510(k) summary. Practice has varied over the decades. Sometimes, the manufacturer and the FDA may forget to update the contents of a 510(k) summary at the end of a review after additional information has been provided, and a 510(k) summary may only reflect what was initially provided to the FDA before all questions were resolved. Be cautious about drawing firm conclusions about what was included – or absent – from a 510(k) on the basis of a 510(k) summary.
If you would like help mining this dataset or determining the best regulatory strategy for your product, I’m available as a consultant through NDA Partners or directly (message me on LinkedIn).
Known issues
Submission Number | Issue | Date Checked | Status | |
---|---|---|---|---|
1 | K050151 | Empty summary | 2023-08-14 | Not resolved |
2 | K222386 | Wrong submission | 2023-08-14 | Not resolved |
3 | K221515 | Wrong submission | 2023-08-14 | Not resolved |
4 | K211740 | Wrong submission | 2023-08-14 | Not resolved |
5 | K202565 | Wrong submission | 2023-08-14 | Not resolved |
6 | K190916 | Wrong submission | 2023-08-14 | Not resolved |
7 | K190027 | Wrong submission | 2023-08-14 | Not resolved |
8 | K170825 | Wrong submission | 2023-08-14 | Not resolved |
9 | K162044 | Wrong submission | 2023-08-14 | Not resolved |
10 | K900070 | Not a 510(k) summary (Complete submission) | 2023-08-14 | Not resolved |
11 | K030515 | Corrupt PDF | 2023-08-14 | Not resolved |
12 | K160695 | Corrupt PDF | 2023-08-14 | Not resolved |
13 | K173946 | Corrupt PDF | 2023-08-14 | Not resolved |
14 | K181029 | Corrupt PDF | 2023-08-14 | Not resolved |
15 | K192198 | Corrupt PDF | 2023-08-14 | Not resolved |
16 | K202408 | Corrupt PDF | 2023-08-14 | Not resolved |
17 | K210112 | Corrupt PDF | 2023-08-14 | Not resolved |
18 | K221619 | Corrupt PDF | 2023-08-14 | Not resolved |
19 | K210801 | Not a 510(k) summary (Decision summary) | 2023-08-14 | Not resolved |
20 | K993307 | Missing pages | 2023-08-14 | Not resolved |
21 | K220672 | Empty summary | 2023-08-14 | Not resolved |
Thanks to Jake W. for identifying many of these.
Changelog
- 2024-02-29:
- Added additional 510(k) summaries.
- 2023-12-23:
- Added additional 510(k) summaries.
- 2023-10-25:
- Added additional 510(k) summaries.
- 2023-08-14:
- Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using
stringr::str_to_upper()
before they are counted. Before, for example, “K100001” and “k100001” would have been counted as different submissions because of the difference in case for the “K”. This fix did not change the results presented in Table 1. - Fixed a spelling mistake in a footnote.
- Added Known issues section.
- Clarified that De Novo reclassification orders are treated as 510(k) summaries for the purposes of this dataset.
- Added new submissions recently posted to the FDA website.
- Added old submissions that were previously missing from the dataset. I believe the dataset is now comprehensive as of 2023-08-14.
- Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using
- 2023-07-17: Initial publication.
Footnotes
Upon initial publication, this dataset included more than 72,000 510(k) summary packages. This includes De Novos, where the reclassification order is used as the summary.↩︎
Upon initial publication, this dataset included more than 494,000 pages.↩︎
See the FDA’s description of the necessary Content of a 510(k), which describes this in more depth.↩︎
The requirement for a 510(k) summary or a 510(k) statement is from the Safe Medical Devices Act (SMDA) of 1990. The regulation, 21 CFR 807.92, was established through an interim rule with 57 FR 18066 on April 28, 1992 and was finalized with 59 FR 64295 on December 14, 1994.↩︎
Reuse
Citation
@online{o'leary2024,
author = {O’Leary, Brendan},
title = {Data for Researchers: {Extracted} Text from More Than 72,000
{FDA} Medical Device 510(k) Summaries},
date = {2024-02-29},
url = {https://www.boleary.com/blog/posts/202307-pmn/},
langid = {en}
}