Skip to main content

Getting Started with Syntegra’s Synthetic Data API

The goal of Syntegra’s Synthetic Data API is to make accessing patient-level healthcare data a lot easier for data scientists, analytics engineers and product developers.

Access thousands, or even millions, of patient records directly in your preferred compute environment to build, test and analyze easier and faster than ever before. 

Whether you’re a “jack of all trades” data professional at an early stage health tech startup, or a data scientist at a larger company, or you’re looking to build your next (or first) AI/ML model — and everything in between — Syntegra’s Synthetic Data API allows you to access the patient-level synthetic healthcare data that you need.

Answering healthcare questions with Syntegra’s Synthetic Data API

Syntegra’s Synthetic Data API provides an easy-to-use tool for answering data science and analytics questions quickly using synthetic data. To demonstrate how this works, Tuva Health co-founder Coco Zuloaga demonstrates how to use the API within a Python Jupyter Notebook environment, showing how to access synthetic healthcare data and compare the characteristics of a full patient population vs. a specific cohort of patients. You can also watch the full example video here.

Accessing the data

After authenticating using your API key, you can run a command to view the synthetic datasets that you have access to based on your subscription, including both EHR and claims datasets in several formats.

r = requests.get(f'{api_url}/datasets', headers=auth_info)
pd.DataFrame(r.json()['Contents'])

You can then ping the API to explore and understand the tables and fields underlying each dataset. In this example, we’ll look at the claims dataset in the pre-processed Tuva Data Model format. View the primary tables within the dataset — such as “condition,” “coverage,” “encounter,” etc. Dig in even further to see what columns are available within these primary tables, like “encounter ID,” “charge amount,” “facility NPI,” “encounter type,” etc.

After deciding that this is the dataset you want to work with, you can request access to your desired dataset via the API, and it will be available on S3. Once there, you can easily save the entire contents of the dataset, as well as the specific tables within it, into a pandas DataFrame.

r = requests.get(f'{api_url}/dataset/CLAIMS_TUVA_SAMPLE/data', headers=auth_info)
print(r.json())

df = pd.DataFrame.from_records(r.json()['Contents']).T

df = pd.read_csv(df.loc["s3Url",2], compression='gzip')
df.head()

 

 

Setting up the analysis

Now that you have access, you can ask a specific question against the dataset. For example: What does inpatient utilization look like for the entire patient population? 

To answer this question, first check the following within the dataset via the API by looking at the “encounter” table:

  1. How many unique patients are in the dataset?
  2. How many acute inpatient encounters occurred?
number_of_patients = len(df['PATIENT_ID'].unique())
print(number_of_patients)

acute_inpatient_encounters = df[df['ENCOUNTER_TYPE'] == 'acute inpatient'].shape[0]
print(acute_inpatient_encounters)

Then, calculate the acute inpatient rate within the full population.

full_population_AIP_rate = acute_inpatient_encounters / number_of_patients
print(full_population_AIP_rate)

 

Defining patient cohorts with concepts

Next, compare the acute inpatient rate of the full population against the same rate for a specific patient cohort.

Syntegra already includes a number of predefined concepts and cohorts that you can explore and work with. Or you can create your own concepts and cohorts relevant to your area of interest. Some of Syntegra’s prebuilt cohorts are related to conditions such as diabetes, liver disease, pregnancy and myocardial infarction, to name a few. Concepts and cohorts that you build are stored within your API workspace (tied to your API key) and can thus be reused. Ping the API to see what concepts are already available.

r = requests.get(f'{api_url}/concepts/', headers=auth_info)
df = pd.DataFrame(r.json()['Contents'])
df

In this example, let’s look at a cohort of diabetes patients. Define the cohort using the API and request data only for this cohort, which then becomes available via S3 and is saved into a pandas DataFrame.

cohort = {"schema": "CLAIMS_TUVA_SCHEMA," "name" : "diabetes", 
                    "definition": ["clm_diabetes"], "private": "True"}
r = requests.post(f'{api_url}/cohorts/', headers=auth_info, data = 
                    json.dumps(cohort))
print(r.json())

r = requests.get(f'{api_url}/dataset/CLAIMS_TUVA_SAMPLE/data?cohort=39', 
                    headers=auth_info)
print(r.json())

Looking again at the “encounter” table of the diabetes cohort dataset, we can again calculate the acute inpatient rate for diabetes patients using the same steps as for the full patient population.

df.loc["s3Url",2] 
df = pd.read_csv(df.loc["s3Url",2], compression='gzip')

number_of_patients = len(df['PATIENT_ID'].unique())
print(number_of_patients)

acute_inpatient_encounters = df[df['ENCOUNTER_TYPE'] == 'acute inpatient'].shape[0]
print(acute_inpatient_encounters0

diabetes_population_AIP_rate = acute_inpatient_encounters / number_of_patients
print(diabetes_population_AIP_rate)

 

 

As this example shows, within less than 10 minutes of using the API, you can make a quick assessment and comparison of the dataset and specific cohort of patients. In this case, after calculating the acute inpatient rates for both the full patient population of the claims dataset as well as the specific cohort of diabetes patients, we made the discovery that the rate for the diabetes population is almost three times as high as it was for the entire population. This is just one of an endless number of exploratory analyses that you can conduct using Syntegra’s Synthetic Data API.

 

More of a visual learner? Watch the full video.

Questions? Not a subscriber? Try out the API for free.

Check out our documentation.

Want to explore different ways of using the API? Reach out!