De-identifying Medical Images with Amazon Comprehend Medical and Amazon Rekognition: A Guide by Chanci Turner

Medical imaging plays an essential role in contemporary healthcare, allowing clinicians to visualize crucial patient information for accurate diagnosis and treatment. The transition to digital medical images has significantly enhanced our capacity to store, share, view, search, and curate these images for the benefit of healthcare professionals. The variety of medical imaging modalities has expanded too, with CT scans, MRIs, digital pathology, and ultrasounds generating vast amounts of data stored within medical image archives.

These images are also invaluable for medical research. Machine learning enables scientists at leading global medical research institutions to analyze extensive datasets, encompassing hundreds of thousands or even millions of images, to gain deeper insights into various medical conditions. However, a significant challenge for healthcare providers lies in using these images while adhering to regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Often, medical images contain Protected Health Information (PHI) embedded as text within the images, complicating the de-identification process. Traditionally, removing PHI has required labor-intensive manual review and editing, making the de-identification of large datasets both time-consuming and costly.

In 2017, Amazon Web Services (AWS) introduced Amazon Rekognition, a machine learning service capable of detecting and extracting text from images with ease. The following year, AWS launched Amazon Comprehend Medical, a Natural Language Processing (NLP) service specifically designed to help users identify and detect PHI in textual data. By leveraging these two services and some Python code, as demonstrated here, users can efficiently and affordably detect, identify, and redact PHI from medical images.

De-identification Architecture

For this example, we will utilize the Jupyter Notebooks feature of Amazon SageMaker to create an interactive environment with Python code. Amazon SageMaker is a comprehensive machine learning platform that streamlines the preparation of training data and the development of machine learning models using pre-built Jupyter notebooks equipped with established algorithms. In this guide, we will employ Amazon Rekognition for text extraction and Amazon Comprehend Medical for PHI detection. Our image files will be accessed from and stored in an Amazon Simple Storage Service (Amazon S3) bucket, which provides top-tier scalability, data availability, security, and performance.

When utilizing Amazon Comprehend Medical to identify PHI, it’s essential to note that the service offers confidence scores for each detected entity, indicating how confident the service is in its accuracy. These scores should be considered, and the identified entities should be reviewed to ensure they meet your specific requirements. For more information about confidence scores, refer to the Amazon Comprehend Medical documentation.

Using the Notebook

You can download the supporting Jupyter Notebook from GitHub. This notebook includes an example chest x-ray image sourced from a dataset available through the NIH Clinical Center. For further details, see the NIH Clinical Center’s CVPR 2017 paper.

At the start of the notebook, you’ll find five adjustable parameters to control the de-identification process:

bucket defines the Amazon S3 bucket where the images will be read from and written to.
object specifies the image you wish to de-identify, which can be in PNG, JPG, or DICOM format. If the file ends in .dcm, it will be treated as a DICOM image, and the ImageMagick utility will convert it to PNG format prior to processing.
redacted_box_color specifies the color used to obscure identified PHI text in the image.
dpi determines the DPI setting for the resulting image.
phi_detection_threshold is the confidence score threshold (ranging from 0.00 to 1.00). Text identified by Amazon Comprehend Medical must meet this minimum score to be redacted from the output image. The default value of 0.00 will redact all detected PHI, regardless of confidence level.

# Define the S3 bucket and object for the medical image we want to analyze.
bucket='yourbucket'
object='yourimage.dcm'
redacted_box_color='red'
dpi = 72
phi_detection_threshold = 0.00

Once these parameters are configured, you can run all cells within the Jupyter Notebook. The initial cell handles the conversion of the specified image from DICOM to PNG, if necessary, and subsequently reads the file from S3 into memory.

# If the image is in DICOM format, convert it to PNG
if (object.split(".")[-1:][0] == "dcm"):
    ! aws s3 cp s3://{bucket}/{object} .
    ! mogrify -format png {object.split("/")[-1:][0]} {object.split("/")[-1:][0]}.png
    ! aws s3 cp {object.split("/")[-1:][0]}.png s3://{bucket}/{object}.png
    object=object+'.png'
    print(object)
…
# Download the image from S3 and hold it in memory
img_bucket = s3.Bucket(bucket)
img_object = img_bucket.Object(object)
xray = io.BytesIO()
img_object.download_fileobj(xray)
img = np.array(Image.open(xray), dtype=np.uint8)

The image can then be sent to Amazon Rekognition for text detection via the DetectText feature. Amazon Rekognition returns a JSON object containing a list of detected text blocks and their corresponding bounding boxes, indicating their locations within the image.

response=rekognition.detect_text(Image={'Bytes':xray.getvalue()})
textDetections=response['TextDetections']

After identifying the text in the image, this text can be submitted to Amazon Comprehend Medical to ascertain which blocks may contain PHI using the DetectPHI feature. Amazon Comprehend Medical will return a JSON object that includes potential PHI entities, their types (name, date, address, ID), and confidence scores for each detection. This information assists in determining which bounding boxes may contain PHI.

philist=comprehendmedical.detect_phi(Text = textblock)

Once the areas of the image that might contain PHI text are identified, redaction boxes can be applied over those areas.

for box in phi_boxes_list:
    # The bounding boxes are described as a ratio of the overall image dimensions, so we must multiply by the total dimensions to get pixel values.
    x = img.shape[0] * box['Left']
    y = img.shape[1] * box['Top']
    width = img.shape[0] * box['Width']
    height = img.shape[1] * box['Height']
    rect = patches.Rectangle((x,y),width,height,linewidth=0,edgecolor=redacted_box_color,facecolor=redacted_box_color)
    ax.add_patch(rect)

The final de-identified image is saved back to the specified S3 bucket in PNG format, with “de-id-” prepended to the original filename.

Conclusion

This guide illustrates the effectiveness and flexibility of combining Amazon Comprehend Medical and Amazon Rekognition for the de-identification of medical images. For further reading on similar essential topics, consider checking out this resource provided by Career Contessa, or explore insights from SHRM about no-code and low-code solutions in HR here. Additionally, you can learn about pitfalls to avoid when working with Amazon through this excellent resource.

De-identifying Medical Images with Amazon Comprehend Medical and Amazon Rekognition: A Guide by Chanci Turner

De-identification Architecture

Using the Notebook

Conclusion

Related Topics:

Comments

Leave a Reply Cancel reply