PHEWAS

today we're talking about genome-wide Association so first we're gonna start with why multi phenotype associations

what is motivation and then we're going to look at different ways of dealing with multiple phenotypes

on one hand dealing with G was and epigenomics across many phenotypes

looking at if genomics or feliz and then looking at meta phenotype inference

image and imputation from the clinic correct so first of all why multi phenotype

analysis so basically what are they trying to do on one hand you know associations Association testing has

been typically between you know genetic variants and a single trait every single

example we looked at so far it was basically looking at a single variable and how it varies across hundreds of

thousands of individuals but that one single variable as opposed to an

increasing trend to have really truly much richer in note a big data so

instead of measuring a single variable across individuals what you can do is actually measure hundreds of variables

or perhaps the entire clinical record for those individuals so that's what he

was is all about it's basically the realization that you know there's a lot a lot more information out there so

therefore we can actually use it to you know good purpose and what can you do

with that you can basically sort of discover new associations with multiple traits you can recognize similarities

and differences between different phenotypes and also infer networks of phenotypic variables and genetic

variables jointly so you can you can use

it for many different types of applications you can basically associate either unknown or underappreciated

phenotypes with known ones you can basically say well hey the activity of the frontal cortex when stimulated with

a learning task is in fact correlated with schizophrenia so that particular

trade by itself might be underappreciated but when you realize that it's in fact a very good predictor

of schizophrenia you can sort of use the two together you can identify disease mechanisms based on the intermediate

phenotypes that mediate them so this can be at the you know from the genetics to

the epigenomic level the transcription level the you know imaging level as I

mentioned earlier for the brain example it's a finger and all kinds of other end

of phenotypes that that sort of can give you a hint as to so how is this leading

ultimately to the more complex phenotype and we're going to see cases were by relating different types

of phenotypes for each other we can invite infer causality between them you can predict disease with easy to measure

biomarker phenotypes for example the level of lipids in your blood might be

highly indicative of I don't know type 1 diabetes or your obesity predispositions

and so forth and these biomarkers can sort of have a lot more information than

genetic data which usually has very strong very weak effects or as these biomarkers may actually have much

stronger correlations it must be may be much more predictive even if they're not causal there they can be helpful

biomarkers you can also use combinations of these phenotypes to impute missing

data for many many more individuals so basically in your electronic health

record you may have the variable of interest or only a small number of people but you may have highly

correlated variables and therefore you can use these highly correlated variables to impute the missing data

across different phenotypic variables and then you know improve your gos power

by including the missing phenotypes across the entire cohort even if you've

only measured that particular lipid you know level in the blood to you know I

don't know a thousand individuals you may be able to impute it across 100,000

individuals and therefore boost your power greatly may be able to improve the

the biological relevance of genome-wide Association studies I actually carrying an association not just with the

phenotype individually with each phenotype individual but with meta phenotypes that might be more indicative

of the true underlying biology because they combine multiple phenotypic

variables in a single umbrella meta phenotype everybody with me so far

awesome you could also use it to understand fly or tropic effects so pleiotropic is a

Greek word sorry and Clio basically means many and like the play aids the

play aids or whatever you call them ya play Alice I would basically have many many stars together so fly o means

many and then tropic is Tropos is the many ways so pleiotropic effect are

basically effects where a particular genetic variant can act in multiple ways or particular gene connect multiple ways

so this is a term that's usually used for you know the same genetic locus or

the same genetic variant sometimes or sometimes the same gene through different genetics effect may actually

affect multiple phenotypes so for example the same gene might be involved in controlling your metabolism inside

your adipose cells and also your food selection in your brain and they're you

know seemingly unrelated pathways but they can converge later on or simply have the same you know locus having many

different effects and otherwise unrelated non you know that you could

also guide both experiments as well as diagnosis by predicting missing

phenotypes you could basically say well hey I can predict your lipid levels will

be unusual so one if you go ahead and take that test and so on and so forth so you can basically say hey this patient

seems to have type 2 diabetes let's measure their glucose level in the blood and then see if we can confirm that

prediction you know which can be made based on any other phenotype and you can

also start moving towards personalized medicine by basically focusing on

combinations of symptoms rather than individual phenotypic variables

everybody with me so far so that's sort of you know strong motivation for studying multiple phenotypes together

and if you look at electronic health records they try to achieve exactly that so basically in chronic health records

contain very rich information it contain you know for example the drugs that a person is taking the lab tests such as

blood lipids for example the person may have taken it can contain imaging information from you know brain

scan or you know x-ray and they also contain free text information from the

doctor taking notes and all of that has been put into various databases over the

last few years increasing digitization of these data sets which can be used in

point of care which can be used to sort of infer statistics across different types variables and which can also be

used to carry out research so this information includes clinical notes so

basically these are unstructured free freeform this contains lab tests basically these are you know

observations and identifier names and you know code billing codes these icd-9

codes international classification of disease icd-9 icd-10 are names that you hear very often which are sort of

standardizing the nomenclature so that this can be interoperable across different systems and different

hospitals and another classification is the DRG classification which is the

diagnosis related groups also there's current procedural terminology or CPT so

all of these are different filling codes they were initially used to sort of help insurance companies pay for different

things but more and more they're recognized to actually be very phenotypically informative of the

patient cells but of course they have a lot of limitation which can be partly overcome by treating that data together

and then these can also include forma suitable prescriptions for example rxnorm can be no one such type of

database or prescription drug everybody with me so far so if you look in the last few years

about a decade ago there was only 10% adoption or any future health records

and we're now at 90% I mean this is a dramatical transformation of the healthcare system in just a very very

short time and this is sort of the places where there were just simply no you know no clinical records so through

having it you know nearly 100 percent frequency in all of these places which

is again a remarkable transformation moreover in addition to being more

prevalent there increasingly comprehensive so you know HR data can include many different types of

variables and more and more they include richer information so initially they

would include comprehensive data for you know much smaller proportion of individuals that even had EHRs now it's

you know roughly forty percent of the ones that have EHRs have comprehensive EHRs and that's something that has

continued those in more hospitals having rich information as well as that

information so the electronic health record data

itself can be thought of in a Bayesian framework as to being a reflection of

the true state of a patient we never observe the true state we only measure various things which can tell us about

the true underlying state so basically the true patient state is something that's hidden from us but you know

through some recording process we can have some raw electronic health record data and that's a reflection of that but

what's really interesting is that the true state also depended aids the type

of health care that the person will go under so basically you know whether a person will be treated or not we'd like

to know I would like to think is actually dependent on the true state of

that patient if the doctors are able to do an accurate diagnosis and that can in turn inform the types of phenotypes that

are gathered for those individuals so if we believe somebody has type 2 diabetes

based another indication we will go ahead and measure their glucose levels

in the blood for example as a way to validate our prediction or to test whether you know they have it or not but

that might not be something that we test routinely for individuals who are not at risk and that sort of a place where that

true state can influence both the values of a measurement as well as whether a

measurement is made at all everybody with me here so that basically means

that the missing data is actually not missing at random it is missing in a

very particular process which is actually not at random and the missing data is more likely to be normal then

the observed data which is more likely to be abnormal so if I only measure

glucose levels for individuals that I expect to be at risk the distribution that I'm gonna get for the observed data

it's gonna be very skewed towards abnormal levels everybody with me on that great so then basically you know

using all that I can then you know develop all kinds of knowledge tasks like specification of

patience into groups prediction of mortality or prediction of whether somebody will get sick and you know

understanding of the underlying biology and also guiding interventions for those

everybody with me so far so basically this is a lot of motivations for why we

want to look at multiple phenotypic variables at once so now let's look at

sort of different ways of doing so basically the first is let's use genetic

information to model this fly of tropics effects here's something that we've you

know touched upon before in terms of the hugely complex architecture of some of

these psychiatric disorder so across you know 3,000 cases and controls there's

the ability to actually you know predict a score associated with psychiatric

disorders and then the score can be associated with addition to Disorders

beyond schizophrenia for example if you predict that score for one trait you can

then sort of turn around and say well is this predictive of another trait and what you find is that for example genetic variants associated with

schizophrenia are also associated with bipolar disorder and you know much more

rarely associated with non psychiatric disorders suggesting some sort of common architecture and common genetic basis

for both in fact that's what led to a push for increasing sample sizes when

nothing was coming out of the genome wide Association studies were basically saying listen we know there's genetic signal there let's just push harder and

we will try to find it so doing this basically resulted in many many

different places in the genome associated with disease but some places

are then starting to be multiplied associated with disease basically if you look at mapping of genome-wide

Association study hits across the genome you basically start seeing some low signs that are associated with many many

traits and that was the first motivation for or he was rather than just she was so

what is fee was doing basically she was is asking for every genetic variant

what is it association with a particular trait what fee was is doing is asking instead of putting different genetic

variants on the x-axis and a particular phenotype or each plot what we're gonna

do instead is ask or a particular genetic variance what are all of the

traits that it is associated with so it's actually flipping G us around and it's basically saying what is the Phenom

wide association between a bunch of different phenotypes and a snip instead

of asking for the association between a bunch of snips and a phenotype it is asking what is the association between a

bunch of phenotypes and a snap okay raise your hands if you're with me so

far awesome great so that's what fee was is doing it's basically flipping things around and you can look for you know

this particular snip which sitting in these lockers what is its associations and you can see

that suddenly there's a lot of different traits that are lighting up there's basically both new plastic traits such

as you know carcinomas in new plasm non-melanoma skin cancer and so on and so forth as well as your meta logicals

traits such as you know here at osis order mitosis or solar-thermal ptosis

and so on and so forth so that is basically saying hey wait a minute through these hormone genetic Berrien I

can associate you know skin disorders

with melanoma disorders and of course biologically this makes a lot of sense but it's nice that it's sort of coming

out of the data directly and this can provide a systematic approach basically look for trade trade relationships

through the genetic parents that they share here's an example of leukoplakia and then

keratosis here's another example where you can see sort of many more cardiovascular traits as well as

grandmother poetic traits which are certain metabolic traits which are very

closely related to each other so myocardial infraction coronary atherosclerosis as well as type 2

diabetes for example are you know both associated with the same look suddenly

you are linking together phenotypic traits that would otherwise be you know separate and studying in isolation and

here's another example from the HLA locus this is you know a lot of where the major histocompatibility complex

sits and where a lot of these serve variants affect the immune system

functioning of different individuals and you can see here that it has a very large number of associations with type 1

diabetes with authorities with you know all kinds of other traits ok any

questions so far to do the same across these building

codes across these electronic health record billing codes so you know you can basically look at all of these codes and

then ask you know how are different trades associated with each other and you can see here for example multiple

sclerosis is associated with all kinds of or cost associated with all kinds of additional traits so basically the way

to think about Feliz is you know to kill many birds with one stone basically have you know a single stone which is snap

and you can look at all of the different birds the you know phenotypes that are associated and you can use that from the

very discovery phase you can basically say well if I have multiple phenotypes to start with instead of thinking us of

a single phenotypic variable in single vector of the phenotype for a particular person let me think of a matrix instead

of n individuals and instead of having only one trait having d traits at the

same time and then you know you can correlate that with the genetic information for those individuals and in

mark and genotype matrix x and then the effect sizes of each of these variants

on each of these traits so basically this becomes a matrix operation or

transforming a matrix to a matrix rather than sort of individual vectors for each

of those and then of course at the same time you have this random effect matrix for M by n videos and then this

environmental component which can be sort of capturing the genetic variation and so you can basically model the

multiple traits using these multivariate linear mixed models instead of univariate in a linear mixed models and

when Matthew Stephens in his group did that in Nature Methods in 2014 they basically

so that depending on the number of phenotypes that are affected and the

number of traits that are associated with them you can basically get a huge boost in power or basically doing multi

trade modeling compared to you know a single trade model basically if you model for traits at a time versus only

two traits at a time you can see these you know dramatic increasing power and increase with parents it's paying for

every snip basically using this multi phenotype on everybody with me so far

so there are many different ways of doing this you can basically study the relationship between different traits by

basically asking hey are the co associated with the same snip are they you know can I sort of boost their power

of individual G was discovering by using them jointly or you can use this LD

score regression and we talked about in our heritability lectures basically calculate the shared genetic component

of multiple traits so this particular case you can basically study pairs of

traits and then ask you know what is the number of individuals for example and then what is the LD score for those

individuals which is basically summing up the associations in the correlation patterns for every snip in that block

and then using that you can infer the LD score for each snip across pairs of

traits and then you can use this LD

score correlation in estimating the relationship between different traits so

for example if you look at you know if somebody has ever smoked you know you

can ask what is the coal Association at a genetic level using LD score

regression with BMI or with childhood obesity or with fasting glucose level or

type 2 diabetes you can see here how each of these traits is basically showing these significant Co

associations in blue here you can see groups of traits are very highly correlated

for example height and then the heads your conference of an infant or the length at birth or the weight at birth

or in fact all very very highly correlated with each other and this is basically by asking what is the shared

genetic component of these traits across

the genome even if you know any one of them might not be genome or significant you can still study the correlation and

find Crohn's disease and ulcerative colitis for example which are you know both no immune traits with the intestine

HDL cholesterol and the age of menarche when the period start you can look at

you know BMI and childhood obesity or fasting glucose and type-2 diabetes you

know depression bipolar and schizophrenia see this huge share channel on everybody with me so far so

you can also do this at the region you can basically go region to region and

then compute what is the base factor how much more likely am I to be associated

with the disease compared to you know by chance or different models and then

calculate the similarity between trades based on you know distinguishing models

of sharing or defect and again this leads to very similar results of sort of groups of trades in

this particular case a lot of different logical trades and anthropological trades sort of related proper metric

related feature and what people realize a job we grow is that there's a very

cool thing that you can do with that you can basically say if I take all of the variants that are truly associated with

VMI do I then see whether they have an effect on triglycerides and I can do the

same basically asking you know if I take all triglyceride Barons you know do they have an effect on BMI and what he found

is a very interesting asymmetry between pairs of traits basically though in in

one direction conditioning basically ascertaining specifically or BMI led to

a correlation with the effect size on triglycerides or as in the opposite

direction if I only look at triglyceride associated parents there was no correlation between their effect size

and triglycerides and effect size EMI and that basically led him to postulate that in fact that was evidence of

directionality of causality that basically if BMI variants truly have an

effect on triglycerides then there will be a strong correlation of the effect sizes when I focus on BMI variants but

no correlation when a folks in triglycerides and you know similarly for LDL cholesterol and coronary artery

disease you basically see that you know focusing on one listed correlation for the other loses that correlation so that

allows us to now start doing causality causal inference from the you know you

need directionality in these conditional associations

was with me so far here are some

additional examples so BMI has an effect on type 2 diabetes but not the other way

around so basically again PMI is a huge risk factor or the mass index huge risk factor for type 2 diabetes but type 2

diabetes does not lead to obesity and conversely having created a normal

thyroid thyroid activity basically is associated with height but not the other

with inverse inversely with height but not the other way around so you can

actually use that to start inferring pleiotropic variants by basically looking at every combination of settings

between pairs of Smith's and their alleles and then asking what is the you

know corresponding effect size on each of the different phenotypes by enumerate

the other approach is to basically start asking about partitioning of the genome

into its phenotypic associations so these are concept here is that we can

now start doing genetics at the system's level basically looking at the whole genome each time and many many different

phenotypic Association so what you know systems genetics tries to do is basically characterize multiple

polygenic associations for complex disorders basically define what is the

normal range define the correlation between genetic variation stress and this was quite a pioneering paper by

Jonah doe proposed more than seven years ago basically start understanding or morbidity pattern for Association

patterns of different traits more recently there's you know several people

who are still wondering hey what is what has come of that field is very systems genetics today there's still a lot of

debate about that so basically no sure we can characterize systems-level

genetic outcomes just not one gene at a time then we can integrate data across

many different modalities we're still missing a lot of sort of the inference of networks

in correlations with larger and larger bio banks that

have genotyped individuals across thousands of trades each profile for thousands of trades we can actually

start going step beyond that and truly deliver on the process on the promise of genetics so in the UK value bank for

example we can look at you know hundreds of different trades and in this

particular case were selecting 47 high robustness once that alkies price and

last part that I want to talk about today is the concept of inferring these

meta phenotypes and then imputing individual phenotypic variables using

those methods so in this particular case we're not gonna focus on genetics we're

only gonna fill up some the phenotype come straight from the other coming

and the reason for that is there are truly millions of individuals in

different hospital systems such as you know the Mayo Clinic Partners HealthCare

see here the Geisinger system I you know

Kaiser Permanente all of these are in fact containing millions of individuals but only a small small fraction of those

have been genotype so basically we're looking at you know 10,000 50,000

100,000 for some of the largest cohorts but still nowhere near the millions so the genotypes are often not available

over the large patient cohorts but given the causal mediating phenotypes diseases

of interest are conditionally independent of genotype so electronic

health records basically allow deep phenotyping to basically start inferring what are the true underlying biological

roots so the idea here is that if you had such an intelligent system you can

basically sort of you know train the model based on existing data and then use it to make clinical recommendations

for different systems you could recommend lab tests could recommend

specific diagnosis doctor that could then verified by the professionals or you could recommend prescriptions and

even treatment procedures if you could do that systematically so this is what

you know we've been trying to do so in this particular case you can see the comorbidity patterns of different

phenotypes this is a plot generated by Jose Davila in collaboration with really and for each of those modules of

comorbidity patterns you can basically see what are the biological terms

associated with corresponding icd-9 code and you can see here sort of you know in

a color-coded fashion how these are clustering with each other in terms of

their formal reading patterns so then the intuition here is that we're going to be factorizing that matrix so

basically across individuals and across phenotypes we're going to be basically looking for a lower dimensional

representation of these associations which can then allow us to infer these

latent factors and these can correspond to both patient to patient similarities

or clusters of patient that sooner biological phenotypic variables as well

as clustering of phenotype together when they have association with similar sets

of patients so then the idea here is that we can sort of move in any one of these

dimensions of these multi-dimensional objects on one hand you can basically look at the similarity of different EHR

features and then the underlying latent disease topic so these are the meta

phenotypes which we can learn from multiple combinations of records and

also the patient dimensionality here so basically by looking at the correlation

between different EHR features or crustations we can infer latent disease topics and similarly by looking at the

correlation between patients across EHR features we can infer patient groups

that are you know more more robust in the information for any one fish

so a very common approach in this particular methodology is to actually

group words by their topic so bags of words kind of approaches basically say

you know all these different words are associated with the arts all these different words are associated with a

budget all these different words are associated with children or with education and by basically simply

counting how frequently are each of these words appearing the particular document you can then classify documents

according to their topic everybody with me here so basically what we're gonna do

is a very similar approach where we're now going to learn healthcare topics

associated with different words and then infer the disease topic of each person

based on the set of words that their clinical record contain

challenges the first challenge is that electronic health record data are extremely noisy and they're also

extremely biased so unlike corpus documents in text mining clinical notes

are full of typos or full of arbitrary abbreviations so here you know you can

see hypertension and then coronary atherosclerosis native CH s sorry CHF

you know and B observe suspect info I

mean these are just you know extremely basically arbitrary abbreviations of you

know different notes if you look at the clinical notes themselves is you can see continued you know Hospit video and so

forth so basically text mining and it's usually used to remove stop words here

there's no clearly defined sup word and then the billing codes are not meant to be disease specific they're just used

for billing so how do you deal with all that so one way to do it

sorry and the second challenge is that the EHR data are extremely sparse if you basically ask what fraction of patients

has which phenotypes you see that if the vast majority of phenotypes are found in

less than you know each in less than 5% of the patients we have basically 30,000

phenotypes there are each associated with you know less than 5% of the patients and then you have some

phenotypes that are associated with a huge huge number of patients and same for icd-9 codes labs and so forth so

that basically means that somehow we have to deal with this extreme extreme sparsity where most of the data is

simply never ascertained as most of these variables are simply empty in many

of the individuals everybody with me for challenge number two and then number three is that missing data is

extremely extremely biased so basically if you look at the fraction of

individuals for which the you know particular variable was measured for

example whether the lab result was taken you see that the you know four so these

are different lab tests 192 lab tests each column is one lab test and then

you're asking what fraction of people showed a normal score versus what

fraction of people were in the top 1% or the bottom 1% you can see here that 20%

of the people are in the top 1% and another 20% of the people and at the bottom 1% how is that possible it's

possible because we are pretty good at you know predicting whether something is normal or not and therefore when a test

is prescribed most of the time we expect to see an abnormal result

and therefore when I see that you know this 20% of the patients had in fact an

abnormally high and an abnormally low test that basically means that I cannot

use the observed data as a way to build the true underlying distribution of

different variables because then I would be greatly amplifying the extremes of

the distribution everybody with me here so

you know what do we need to do basically what we need is you know just different ways of specifically dealing with the

non missing at random problem so not missing in random is something that has

been you know well recognized if you look at for example the users that

select music ratings what they basically you know if you if you look at their

users you basically see this if you ask randomly what should you know my score P

but you will find is that very few people give the best score in fact you

know if you randomly assign music to people you will see that most of the time they're not gonna like what they

listen to and then some of the time they're gonna like it a lot but if you look now at the music that

people actually listen to they self select songs that they expect like but

then they end up with a much higher proportion of 5 stars and 4 stars and 3

stars than what you would expect if you were assigning these random everybody

with me so far so that basically means that if the data is missing that usually means that

they're not gonna like that song because they haven't even bothered playing that

song does that make sense so in the same way with clinical data if

something is missing very likely to be you know normal worse if something is

observed it's much more likely to be abnormal and in this particular case the observed data is very frequently

abnormal sort of you know really like the song I'm more likely to observe so

there's been a lot of work in sort of explicitly modeling the pattern of

missingness of basically studying the the driver mechanism underlying this

missingness pattern so the concept here is that you can actually distinguish

different missing mechanisms so you can basically have a mechanism of missing

completely at random whereby the you

know the data is not the missing data is neither driven by the observed data nor

driven by the missing data they're simply missing at random which basically says that the missing indicator only

depends on the observed results but not on the missing results themselves and then there's also non missing at random

where you know we have to actually explicitly model both the missing and

the non missing data so that basically means that the value of the missing data

is in fact influencing whether the data is missing or not so as we saw before

the tests that are missing are more likely to be normal and they set the

tests that are observed or more likely to be abnormal everybody with me so far

so basically if you look at these model parameters basically this is you can

explicitly create a variable associating with missingness you can basically say

here's an indicator variable it will tell me whether the data is missing or not and then I can have a partially

observed result knowing what cluster the patient belongs in and then each

both about the true underlying value and the the probability that I would even

observe that variable and that's the whole concept basically first grouping

the patients into clusters we can then learn which clusters are associated with

abnormal scores and then do a cluster conditional estimation of the values for

specific lab tests so that if the lab test is observed that will allow me to

sort of cluster the patients based on all of the phenotypic information for those patients but also the you know lab

test itself so I can you know start fostering the patients and then within that cluster I can sort of say well the

probability with which the variable is missing is in fact highly dependent on

the cluster members raise your hands if you're with me awesome so then the key

idea is to sort of code the missing indicator as part of your model you could have the missing indicator as part

of your data itself or you could simply have it as a latent variable within the model and here you can see that you know

when you look at overall prediction of medical prescriptions for example you

can see how many variables do include missing this information from four you

can see here that you know after five variables or so there's continued improvements in the overall prediction

based on the re area under the silver pre-earth which basically suggests that

indeed modeling the missingness patter is extremely important okay then I can

jointly model both the lab test assignment whether it was prescribed a

particular lab test regardless of its value and what the actual result of that

Baptist was so for example if I belong in a set of patients with respiratory problems

I'm much more likely to be asked to have a complete blood cell count and that

completely docile cow is much more likely to be abnormal and same for c-reactive protein if I have

kidney problems similarly you know much more likely to be prescribed urea nitrogen test then you know if I don't

so you can actually start modeling this I said of having the patient cluster

dictates both the lab result and lab presence itself and having a lot

presence dependent on the lab result but of course I don't observe that directly but knowing the probability of the

patient belonging that particular cluster I can actually flow information on so that's basically an example of

explicitly modeling the missing mechanism basically says was this lab observed or not sounds good so then you

know we can do this across many different phenotypic variables basically

using both love presence and lab results I can basically learn hyper parameters

that sort of dictate the probability that you know particular lab variable was observed or not and the value of

that variable so then we can jointly model these missing indicators and the

lab test results whereby the observation for a particular patient depends on a

score with which that particular patient belongs in the particular topic the

score likelihood for that particular question of that lab test and the

missing indicator like or that particular level but then basically says

that the number of patients assigned to a particular topic when L is observed

versus well L is not observed are in fact highly dependent on that class so

then you can do that both for lab tests and or all of the other phenotypes basically looking at you know various

binary clinical features looking at you know all kinds of icd-9 codes and etc

you can basically start Modelling number one the patient Mehta phenotype the class the topic over the

disease topic that the patient belongs in is the words their phenotypic past their meta phenotype and conditioned on

that class both the frequency of the different phenotypes and whether they

are taking a particular test and what the results of that test will be that's

good basically I can use this relatively simpler framework for all of the phenotypes and then this relatively more

complex framework or the labs because of the prescription bias for the lab itself

so then what you le did in the group is he developed this model called mix EHR

or this mixture model for electronic health records that now allows you to do

exactly what was in the previous slide basically you know model is meta

phenotype and then treat this missing indicator for each of the variables as

well as all of the additional affinity asks so what he does is that he

decomposes the matrix of patients by lab tests by icd-9 billing code by

questionnaire prescription treatment and all of the clinical notes and then for

each of these different data modalities he has a particular factorization and then a loading matrix that allows you to

combine these different modalities together so then he can predict the risk for each patient belong in each class of

mental phenotypes and then infer the you

know the disease class for each for each person so you know cutting across this

you can basically see your g-code icd-9 codes you know lab prescriptions

clinical notes and medical prescriptions and then sort of have this hierarchical

model allows you to at the top level infer the meta phenotype and then within each of the data modalities for each

topic specific enter phenotype again partitioning this phenotypic matrix into

each component so what you can do is then start asking

in simulation whether the model is in fact capturing you know simulated data

and we can see is that the non missing at random component is in fact greatly

are performing the missing at random model basically if we assume that the

random data is randomly missing that the missing data is randomly missing then we do much worse performance whereas if we

assume that the data is not better performance and then you know it agrees

with the true model a much much more frequently you would expect

here's some terms for example that are very heavily associated with different

meta phenotypes so basically here you're learning the different meta phenotypes and then what you can ask is what were

the individual disease phenotypes or associated with that and what you can see is that or every one of these meta

phenotypes there's a large number of different data modalities attributed to this so that basically suggests that I

can you know combine information from clinical notes from labs from

prescriptions from you know billing codes from you know all of these

different data modalities it's shown here in different colors you can also

ask how are these topics correlated with quantitative variables such as the age

of the individual and we can basically asking what are the topics that are the most positively correlated with age and

what are the more topic started the least partly dated it was the most negatively correlated with age and we

can see is that for example heart failure cardiovascular disease and dementia are three of the most

positively correlated with age which are diseases of old age and then if you look

at the most negative correlation with age again these are you know neonate earth and you know preterm birth and so

forth which are you know with the earliest age group and you know what's

really interesting is that preterm is in fact even more negatively associated and

say muni which it's again fully consistent with a biology you can also

ask for different classes of traits for example for schizophrenia associated trades or PTSD associated traits what

are the of scoring terms and here you

can see for example from the clinical notes delusion in schizophrenia and bipolar and you know coach anton and

psych and psych spelled and psych code and you know etc are basically featuring

in the clinical notes of the doctors that are dealing with these patients in with patients in this cluster the icd-9

billing codes seem to agree with that you know and in fact with schizophrenia you also see poison

tranquilizer for example or toxic effect caustic no other detail it so effective

disorder and so on and so forth so you can see here how across different data modalities we're combining terms that

are all strongly indicative of Renea or of post-traumatic stress disorder

everybody with me so far so what you can also do is start prioritizing patients

based on the disease mixture so here for example you can see that there's only a

subset of the individuals who are in this particular cluster that were tagged

with leukemia or with pulmonary embolism or with cirrhosis liver cirrhosis you

can see that there's like grey squares here that where that particular phenotype was not explicitly declared

and if you if you look here you can see why these patients were in fact selected

to be part of that group and you can see here that even though they did not have

you know leukemia explicitly mentioned they had all these other phenotypes that

are highly indicative of leukemia here even though these guys didn't have pulmonary embolism explicitly mentioned

they have all these other phenotypes or highly predictive of that and then you can actually explicitly see what are the

terms color coded according to the evidence type that are associated with

each of these clusters and in clustering 31 for example you can see you know what

are the most informative terms in this word cloud and similarly for topic 35 and for topic 15 you can see that

there's many different types of lines of evidence which are all combined together

to to infer the membership of these

patients and therefore to complete what would otherwise be missing informative F

so that to now start imputing missing EHR codes

so you know we can go from something that's extremely sparse to actually

imputing the values for each of those and you can see here that many of the white caps are getting filled in

suggesting that in fact maybe the doctors should have added that additional term and you can also ask

depending on you know basically you could hide some of these variables

basically say well you know I know that the patient has that but I'm gonna remove that code you know am i able to

recover this and the answer is you know overwhelmingly yes so basically 82 percent of the time if you remove a term

we're able to just no predict back and in fact this is also heavily biased because you know usually when you add

one particular term you know that might be the only the only one who might you you bother marking may not mark it for

us then removing that one term and in fact leads the decreased membership in

that particular class the other thing you can do is in fact predict future mortality so based on the

visits today will the patient die the next time to come to the hospital you

know that can be actually very helpful and the answer is that overwhelmingly

yes so basically you have 85% at UC of actually predicting whether a patient

will hospital so you know if we if we can do this reliably we probably want to

let people know one you know warn them in advance don't wait until your next visit maybe should be you know treated

and you can also ask what are the terms that are the most predictive of mortality you can basically see that you

know if you have a ventricular fibrillation for example or you know cardiac arrest at the last visits or

anoxic brain damage and so forth all of these you're you know in pretty bad

shape for making it through your next visit or renal failure acute necrosis of liver and so forth or

if you're in a mechanical ventilator basically is a good indication that next

time around we won't be able to make it and some and so forth so basically what

this is you know basically telling us is that we can actually make not just

classification of the current state of the patient but also prediction of the

future state of the patient so just to summarize this part basically these phenome wide Association studies can

elucidate phenotypic the networks by linking genetics to their cause Russia

trades you can model multiple trades jointly reveal the genetic correlations

between them and then you can start modeling each our data recognizing that it's extremely rich but there's many

challenges in modeling because of the high dimensionality high sparsity and non-missing at random and then these

machine learning methods are especially helpful and especially generative models and they hold great promise to learn

compress latent dimensions of these super high dimensional data sets and also me impute the missing data to

reveal the underlying what recovered today we basically motivated why we even care about these

multiple phenotypes studied jointly and I introduced the concept of fee was of

basically associating every phenotype with every other phenotype rather than genetic parent with every other

phenotype in the context of specific genetic variants we expanded that to

start looking at LD blocks and their Co Association patterns across an entire clinical record such as from the UK

biobank you went a step further from the

region by region Co Association to a factorization of the genotype-phenotype

correlation matrix you basically understand the different factors of these disease associations we looked at

our epigenomic enrichment analysis that we had seen before but now in the context of multiple genomic innovations

and multiple phenotypic variables simultaneously basically learning the Cowan Richmond patterns of specific

phenotypic classes into similar epigenomic annotations and then using

that to improve the enrichment and improve the pathway inferences and then lastly how we can model these electronic

health records by specifically looking at meta phenotypes basically these

higher-order terms that can combine information from multiple individual

data modalities such as you know clinical notes lab tests prescriptions icd-9 billing codes and so forth and how

we can in fact use that to impute missing data and to also prioritize

patient disease feels that they've learned stuff and awesome one more

SamuelYHuang

Search This Blog

PHEWAS

Comments

Post a Comment