What is protected health information? | Tutorial and examples

Q: What is Protected Health Information?

Protected Health Information (PHI) is a form of PII including anything about your medical history, treatment, advice, diagnoses, etc. It often includes data that can be easily used to identify you, and therefore is important to store, move, and use securely. Especially in the age of electronic health records (EHRs)!

PHI: the basics

What is PHI?

Protected Health Information (PHI) is individually identifiable health information that is regulated under the U.S. Health Insurance Portability and Accountability Act of 1996. It is essentially personally identifiable information (see our PII lesson for more info) involving medical/health details.

PHI may, for example, include (according to U.S. HHS):

Information doctors, nurses, and other health care providers put in a person’s medical record
Conversations a doctor has about a person’s care or treatment with nurses and others
Information about a person in their health insurer’s computer system
Billing information about a person at their clinic
Most other health information about a person held by those who must follow HIPAA laws

According to U.S. HHS, to protect PHI, covered entities (described in more detail below) must:

Put in place safeguards to protect health information and ensure they do not use or disclose health information improperly.
Reasonably limit uses and disclosures to the minimum necessary to accomplish their intended purpose.
Have procedures in place to limit who can view and access health information as well as implement training programs for employees about how to protect health information. Business associates also must put in place safeguards to protect health information and ensure they do not use or disclose health information improperly.

About this lesson

In this lesson, you’ll understand the concept of Protected Health Information (PHI), its evolution to electronic forms, its relevancy in software development, potential risks posed by negligent storage, and best practices for handling and protecting it.

Who must follow these guidelines?

As mentioned above, the entities that are required to adhere to HIPAA are called ‘covered entities.’ According to U.S. HHS, this includes Health Plans (private health insurance, company health plans, government health plans, etc.), most Health Care Providers (those that conduct certain business electronically, for example, billing your health insurance), and Health Care Clearinghouses (entities that process non-standard health information and convert it to standard formats – ex. digitizing medical content). Business associates of ‘covered entities’ must also follow parts of HIPAA, which includes bodies like billing agencies, separate companies who process insurance claims, outside lawyers, etc. However, not all entities you might think have to comply with HIPAA, for example life insurers and employers.

According to U.S. HHS guidance, HIPAA allows PHI to be shared for a person’s treatment and care, business concerns for doctors/hospitals, with relatives/loved ones they approve of, with nursing home staff, public health officials, and police (ex. for gun shots). However, PHI cannot be used or shared without the person’s written permission unless it is explicitly allowed by law. For example, providers cannot typically share any such information for marketing purposes. Plus, individuals have certain rights under HIPAA, including to access their own data and see where it is being used or disclosed.

HIPAA violations may result in significant civil and criminal penalties, including substantial fines and potential imprisonment. The specific penalty amounts and terms vary based on the nature and severity of the violation. For current penalty information and enforcement guidelines, consult the official HHS Office for Civil Rights website and other authoritative government sources.

Importance of PHI

Although the Institute of Medicine pushed for Electronic Health Records (EHR) back in the 1990s, they have only continued to grow in prominence since then, posing ethical and security dilemmas surrounding data ownership, liability, informed consent, and privacy.

Today, most health records are contained in relational or hierarchical databases that scope far beyond a single paper copy stored in a sectioned folder. Their development is typically handled by vendors and they are built to be client-server-based with the capability to share between relevant facilities. An entire data standards system had to be introduced to protect this information – while maintaining its interoperability and accessibility.

Some people avoid seeking treatment because of fear that their information is not secure! Mishandling of PHI can lead to severe consequences for both the patients and the businesses with access to their information. Developers play a vital role in securing patient trust. Consistent responsibility with PHI is part of the foundation of customer loyalty and conversely even a single breach of information can deteriorate a company’s public image. Not to mention potentially leading people who need medical treatment to avoid seeking help.

See the charts below from HIPAA Journal to visualize the increases in different types of breaches.

Data breaches have become more common since 2009 (the year the Office of Civil Rights started publicizing records), and 2024 housed the largest ever reported healthcare data breach. 190 million people were affected after a ransomware attack on Change Healthcare/UnitedHealth Group, which is malicious software that denies access to user data using encryption until a ransom is paid. During development, it is vital to be mindful of access control, encryption, data integrity, transmission security, and more in order to avoid detrimental breaches like this one.

PHI in action

Developers that handle PHI may be subject to multiple laws and regulations, including HIPAA and, in some cases, requirements under the 2020 Cures Act Final Rule, which outlines requirements involving the use of APIs, interoperability, security, documentation, and authentication. The aim of this particular legislation was to make healthcare apps and APIs more interoperable and standardized, which also uplifts security standards across the industry. Storage of PHI is complicated and sensitive information can be contained in many front and back-end processes – and is often transported between different locations. Not only do health records chronicle an individual’s medical history, which is sensitive in and of itself, but there is often also billing info, addresses, emails, and other Personally Identifiable Information.

Not to mention, sophisticated back-end logging mechanisms are required for EHRs under the HIPAA Security Rule, and are a vital component of tracing where breaches come from, whose data is impacted, and when it occurs. However, logs store lots of PII, which can be troublesome and requires encryption of the data to avoid breaches from the log itself.

Let's look at an example under the hood of storing PHI without encryption (via a python function).

Here, a developer created a simple function to input patient data into the patient table within a SQL database. However, because the information is inputted directly into the database as plain text, if an attacker were to perform a SQL injection for example (see our SQL injection lesson) within a login, they could access any user data they want with no safeguards.

Neglecting data encryption throughout the data storage process is extremely problematic. While encryption is a well-known requirement, developers sometimes only encrypt data in transit (often using Transport Layer Security/Secure Sockets Layer) but fail to encrypt it at rest (within physical hardware components, like on a server or hard drive, and within software like back-end logging). Both are essential to protect PHI from unauthorized users.

Some other common mistakes include how much data to collect, underestimating the complexity of HIPAA, and using simple login techniques.

Balancing how much data is appropriate to collect can be a gray area. EHRs are starting to include family health history, information on social and genetic factors of disease, and entire genomic sequences. Biosurveillance is being used to detect public health events. Yet data collection involving health can be paradoxical. On one hand, collecting genomic sequences, familial history, etc. can lead to copious insight and innovation regarding public health, but having so much data on any one individual maximizes the risk associated with a data breach.

Developers need to strike a balance between what is helpful and what’s appropriate and in line with applicable law. We need to avoid invasion of privacy while maintaining bioinformatic power. Data validation, protection of the collected data, and mindfulness about the information that is appropriate to use and transfer can all combat the issues arising from the massive volume of data.

Underestimating the complexity of compliance with HIPAA and other federally enforced guidelines is another common error among developers. It can be difficult for them to sort through all of the requirements, and interpretation can be subjective/dependent on the specific circumstances. The information below will help to outline necessary best practices that may assist with complying more vastly across the HIPAA outline.

Finally, many EHR apps and APIs require only a username and password, and fail to implement more robust systems like multi-factor authentication or token management protocols to avoid password storage altogether. Because of the nature of the information stored in EHRs, there should be serious protection against unauthorized access and these are good ways to combat the problem.

Generally, developers should adopt a privacy-first approach along all steps of the design process. Incorporating security and privacy considerations into every phase of the development lifecycle to ensure that PHI is always protected can help prevent breaches like the one discussed above, where 190 million people were harmfully exposed. The next section will discuss best practices in more detail and provide a guide for developers to more readily comply with codified guidelines.

PHI mitigation

Encryption and transmission security

One of the most influential and accessible safeguards for PHI is encryption, both for information in transit and at rest. Encryption allows PHI to be locked up rather than being stored as plain text that anyone who can find it is able to read. When at rest, data is ordinarily stored in databases, files, and backups, and in transit is found across networks, APIs, and between clients and servers. Encryption has two main forms – symmetric, which uses a single secret key to encode the information, and asymmetric, which has a pair of keys, one that is private and one that’s public. While asymmetric encryption can be more secure, symmetric tends to be more common, and developers should refer to AES (Advanced Encryption Standard) for data at rest and TLS (Transport Layer Security) for data in transit.

Some other considerations for developers are encrypting sensitive data within logs, cache systems, and wherever PHI might be temporarily stored, and all encryption keys being organized to ensure that they are securely stored and rotated regularly. Only the people, systems, and apps that regularly need to access the PHI should have the keys, and key separation and rotation ensure that the information stays more reliably secure. Keys should never be stored with the encrypted data.

Access and audit control

Generally, access and audit controls are hardware, software, and/or procedural mechanisms that keep track of and analyze activity in information systems.

Firstly, authentication confirms that a user is who they are claiming to be, and authorization separates access based on the kind of user, both of which are components of access control – which defines who can retrieve information. The principle of least privilege should guide decision making, where users and systems should have only the minimum level of access necessary for their function/need. Failing to implement granular, role-based access control (RBAC) can be detrimental. For example, a hospital receptionist should not have the same access to patient data as their physician.

Multi-factor authentication (MFA) for accessing sensitive areas of an application or database can also notably uplift access control mechanisms. MFA limits the chances of data leaks and exposures, and adds an extra layer of security, requiring users to provide multiple forms of identification (ex. a password and a code from a separate device) to access PHI. Even if a password is compromised, authentication remains safeguarded.

Additionally, an audit trail is a record of events and activities within a software system. Apart from promoting HIPAA compliance, implementing comprehensive, unchangeable system logs that track every action taken on PHI allow entities to internally chronicle who accesses information and analyze the log in case of a security breach to identify who it affected, when it occurred, and why it happened.

Masking and de-identification

Masking and de-identification of PHI is an important step in protecting privacy in non-production environments such as testing, development, and research. The difference between the two is: Masking obscures information, is irreversible, and keeps the data in its format unlike encryption. For example, a social security number might only be displayed as “_ _ _ - _ _ - 1234” rather than listing all 9 digits. However, de-identification refers to removing personally identifiable characteristics from data altogether, making it impossible to trace it back to an individual.

De-identification allows researchers to use medical records without them being individually identifiable, and in many cases, allows developers to analyze user behavior or trends and app functionality without compromising privacy. This can be achieved, according to HIPAA, by one of the two main methods shown below. However, the process is tricky and there are cases when combinations of different information can identify an individual even when the data is de-identified. Developers need to stay aware of this risk and regularly review techniques to ensure they are robust and effective, following HIPAA guidelines and any other applicable laws.

See this chart from the HIPAA website for some approved methods of de-identification.

Data integrity

Another main pillar of security involving PHI is data integrity, both in its maintenance and disposal. While being stored, developers must ensure that data is not altered or destroyed unintentionally, is validated after being inputted by clinicians, is retained properly, and is disposed of mindfully if necessary. This means designing systems with outlined validation, retention, and disposal policies. Validation is important along every step of the data entry system – input validation, cross-field validation, and database constraints all make sure the clinicians are inputting patient information that makes sense. Plus, these methods also help to prevent bad actors from accessing data through methods like SQL Injection.

Retention defines how long it is appropriate to keep PHI, especially in the context of someone dying or requesting a change to their EHR. When is it necessary to get rid of someone’s information? Developers must build in the capability to delete user data upon request even though data in EHRs is not often deleted completely because of its relevancy in public health, research, and disease prevention. De-identification may be a happy medium a lot of the time, but when it comes to disposal, it’s not enough to simply delete files from the database or storage. Secure deletion practices should be implemented to ensure that PHI cannot be recovered after it is removed. This could include methods like overwriting sensitive files with random data or using cryptographic erasure for encrypted data. Developers should also be mindful of where PHI might be stored, such as in backups or logs, and ensure that these are purged as part of a regular maintenance routine. Implementing strong data retention and disposal practices reduces the amount of data at risk and helps organizations stay compliant with privacy laws.

Secure natural language processing

Finally, NLP, which is an AI mechanism used to understand, interpret, and process human language, introduces a slew of security issues that need to be mitigated specifically in the context of PHI. In this case, natural language processing is being used to process text (like physician notes), extract the most important information, define relationships between entities, and structure data so that it can potentially be used in Clinical Decision Support Systems (CDSS – which is an new AI tool development used to help physicians make diagnoses and decisions). This innovation has the ability to make some real strides in the medical sphere, but also comes with potential risks. Some of the risks include:

Training data vulnerabilities, such as data poisoning or sensitive data exposure in training.
Interference attacks and data leakage, such as hackers exposing a patient’s record if it was used to train the model, inferring medical information for an individual deduced by the model, and prompt injection.
Implementation and integration risks, like insecure APIs, lack of granular access control, and third-party vendor risks (a breach in the vendor’s system can compromise patient data from multiple clients).
And poor auditing, which makes tracking down a breach so much more difficult or impossible.

Using the methods above can help to mitigate NLP-introduced insecurities, and Snyk Learn has a whole learning path for OWASP’s Top 10 LLM and GenAI.

Conclusion

On the organizational level, having designated security personnel to check development strategies is a good way to ensure it's a priority and maintain trust with customers/users. As Protected Health Information and its formats are growing vaster, it is becoming more and more vital to secure patient information – EHRs can now include images from the bedside table, documentation templates, mobile apps (patients can play a more active role in their medical data), and mental and behavioral health is included (which is of particular concern privacy-wise). APIs and apps are a promising avenue to uplifting patient-centered care and clinical workflows, but require careful development and strict compliance with HIPAA guidelines. The best practices outlined above collectively create a robust framework for safeguarding sensitive information in modern software systems, and allow developers to handle PHI responsibly, maintain user trust, and comply with regulations.

Quiz

Which of the following is a reason why failing to encrypt data at rest is a pitfall in software development involving PHI?

Keep learning

Learn more about personally identifiable information and other best practices here:

Our lesson dedicated to PII here
Lessons on encryption and the OWASP Top 10

Congratulations

You have taken your first step into learning more about PHI and how to keep it secure. We hope that you will apply this knowledge to make your applications safer, and make sure to check out our lessons on other common vulnerabilities.

Protected health information (PHI)

What is it and how do we secure it?

General

PHI: the basics

What is PHI?

About this lesson

Who must follow these guidelines?

Importance of PHI

PHI in action

Scan your code & stay secure with Snyk - for FREE!

PHI mitigation

Encryption and transmission security

Access and audit control

Masking and de-identification

Data integrity

Secure natural language processing

Conclusion

Quiz

Quiz

Keep learning

Congratulations

FAQs

What to learn next?

Protected health information (PHI)

What is it and how do we secure it?

General

PHI: the basics

What is PHI?

About this lesson

PHI in research

Who must follow these guidelines?

Importance of PHI

PHI in action

Scan your code & stay secure with Snyk - for FREE!

PHI mitigation

Encryption and transmission security

Access and audit control

Masking and de-identification

Data integrity

Secure natural language processing

Conclusion

Disclaimer

Quiz

Quiz

Keep learning

Congratulations

FAQs

What to learn next?