PHEWAS

 

PHEWAS

today we're talking about genome-wide Association so first we're gonna start with why multi phenotype associations
what is motivation and then we're going to look at different ways of dealing with multiple phenotypes
on one hand dealing with G was and epigenomics across many phenotypes
looking at if genomics or feliz and then looking at meta phenotype inference
image and imputation from the clinic correct so first of all why multi phenotype
analysis so basically what are they trying to do on one hand you know associations Association testing has
been typically between you know genetic variants and a single trait every single
example we looked at so far it was basically looking at a single variable and how it varies across hundreds of
thousands of individuals but that one single variable as opposed to an
increasing trend to have really truly much richer in note a big data so
instead of measuring a single variable across individuals what you can do is actually measure hundreds of variables
or perhaps the entire clinical record for those individuals so that's what he
was is all about it's basically the realization that you know there's a lot a lot more information out there so
therefore we can actually use it to you know good purpose and what can you do
with that you can basically sort of discover new associations with multiple traits you can recognize similarities
and differences between different phenotypes and also infer networks of phenotypic variables and genetic
variables jointly so you can you can use
it for many different types of applications you can basically associate either unknown or underappreciated
phenotypes with known ones you can basically say well hey the activity of the frontal cortex when stimulated with
a learning task is in fact correlated with schizophrenia so that particular
trade by itself might be underappreciated but when you realize that it's in fact a very good predictor
of schizophrenia you can sort of use the two together you can identify disease mechanisms based on the intermediate
phenotypes that mediate them so this can be at the you know from the genetics to
the epigenomic level the transcription level the you know imaging level as I
mentioned earlier for the brain example it's a finger and all kinds of other end
of phenotypes that that sort of can give you a hint as to so how is this leading
ultimately to the more complex phenotype and we're going to see cases were by relating different types
of phenotypes for each other we can invite infer causality between them you can predict disease with easy to measure
biomarker phenotypes for example the level of lipids in your blood might be
highly indicative of I don't know type 1 diabetes or your obesity predispositions
and so forth and these biomarkers can sort of have a lot more information than
genetic data which usually has very strong very weak effects or as these biomarkers may actually have much
stronger correlations it must be may be much more predictive even if they're not causal there they can be helpful
biomarkers you can also use combinations of these phenotypes to impute missing
data for many many more individuals so basically in your electronic health
record you may have the variable of interest or only a small number of people but you may have highly
correlated variables and therefore you can use these highly correlated variables to impute the missing data
across different phenotypic variables and then you know improve your gos power
by including the missing phenotypes across the entire cohort even if you've
only measured that particular lipid you know level in the blood to you know I
don't know a thousand individuals you may be able to impute it across 100,000
individuals and therefore boost your power greatly may be able to improve the
the biological relevance of genome-wide Association studies I actually carrying an association not just with the
phenotype individually with each phenotype individual but with meta phenotypes that might be more indicative
of the true underlying biology because they combine multiple phenotypic
variables in a single umbrella meta phenotype everybody with me so far
awesome you could also use it to understand fly or tropic effects so pleiotropic is a
Greek word sorry and Clio basically means many and like the play aids the
play aids or whatever you call them ya play Alice I would basically have many many stars together so fly o means
many and then tropic is Tropos is the many ways so pleiotropic effect are
basically effects where a particular genetic variant can act in multiple ways or particular gene connect multiple ways
so this is a term that's usually used for you know the same genetic locus or
the same genetic variant sometimes or sometimes the same gene through different genetics effect may actually
affect multiple phenotypes so for example the same gene might be involved in controlling your metabolism inside
your adipose cells and also your food selection in your brain and they're you
know seemingly unrelated pathways but they can converge later on or simply have the same you know locus having many
different effects and otherwise unrelated non you know that you could
also guide both experiments as well as diagnosis by predicting missing
phenotypes you could basically say well hey I can predict your lipid levels will
be unusual so one if you go ahead and take that test and so on and so forth so you can basically say hey this patient
seems to have type 2 diabetes let's measure their glucose level in the blood and then see if we can confirm that
prediction you know which can be made based on any other phenotype and you can
also start moving towards personalized medicine by basically focusing on
combinations of symptoms rather than individual phenotypic variables
everybody with me so far so that's sort of you know strong motivation for studying multiple phenotypes together
and if you look at electronic health records they try to achieve exactly that so basically in chronic health records
contain very rich information it contain you know for example the drugs that a person is taking the lab tests such as
blood lipids for example the person may have taken it can contain imaging information from you know brain
scan or you know x-ray and they also contain free text information from the
doctor taking notes and all of that has been put into various databases over the
last few years increasing digitization of these data sets which can be used in
point of care which can be used to sort of infer statistics across different types variables and which can also be
used to carry out research so this information includes clinical notes so
basically these are unstructured free freeform this contains lab tests basically these are you know
observations and identifier names and you know code billing codes these icd-9
codes international classification of disease icd-9 icd-10 are names that you hear very often which are sort of
standardizing the nomenclature so that this can be interoperable across different systems and different
hospitals and another classification is the DRG classification which is the
diagnosis related groups also there's current procedural terminology or CPT so
all of these are different filling codes they were initially used to sort of help insurance companies pay for different
things but more and more they're recognized to actually be very phenotypically informative of the
patient cells but of course they have a lot of limitation which can be partly overcome by treating that data together
and then these can also include forma suitable prescriptions for example rxnorm can be no one such type of
database or prescription drug everybody with me so far so if you look in the last few years
about a decade ago there was only 10% adoption or any future health records
and we're now at 90% I mean this is a dramatical transformation of the healthcare system in just a very very
short time and this is sort of the places where there were just simply no you know no clinical records so through
having it you know nearly 100 percent frequency in all of these places which
is again a remarkable transformation moreover in addition to being more
prevalent there increasingly comprehensive so you know HR data can include many different types of
variables and more and more they include richer information so initially they
would include comprehensive data for you know much smaller proportion of individuals that even had EHRs now it's
you know roughly forty percent of the ones that have EHRs have comprehensive EHRs and that's something that has
continued those in more hospitals having rich information as well as that
information so the electronic health record data
itself can be thought of in a Bayesian framework as to being a reflection of
the true state of a patient we never observe the true state we only measure various things which can tell us about
the true underlying state so basically the true patient state is something that's hidden from us but you know
through some recording process we can have some raw electronic health record data and that's a reflection of that but
what's really interesting is that the true state also depended aids the type
of health care that the person will go under so basically you know whether a person will be treated or not we'd like
to know I would like to think is actually dependent on the true state of
that patient if the doctors are able to do an accurate diagnosis and that can in turn inform the types of phenotypes that
are gathered for those individuals so if we believe somebody has type 2 diabetes
based another indication we will go ahead and measure their glucose levels
in the blood for example as a way to validate our prediction or to test whether you know they have it or not but
that might not be something that we test routinely for individuals who are not at risk and that sort of a place where that
true state can influence both the values of a measurement as well as whether a
measurement is made at all everybody with me here so that basically means
that the missing data is actually not missing at random it is missing in a
very particular process which is actually not at random and the missing data is more likely to be normal then
the observed data which is more likely to be abnormal so if I only measure
glucose levels for individuals that I expect to be at risk the distribution that I'm gonna get for the observed data
it's gonna be very skewed towards abnormal levels everybody with me on that great so then basically you know
using all that I can then you know develop all kinds of knowledge tasks like specification of
patience into groups prediction of mortality or prediction of whether somebody will get sick and you know
understanding of the underlying biology and also guiding interventions for those
everybody with me so far so basically this is a lot of motivations for why we
want to look at multiple phenotypic variables at once so now let's look at
sort of different ways of doing so basically the first is let's use genetic
information to model this fly of tropics effects here's something that we've you
know touched upon before in terms of the hugely complex architecture of some of
these psychiatric disorder so across you know 3,000 cases and controls there's
the ability to actually you know predict a score associated with psychiatric
disorders and then the score can be associated with addition to Disorders
beyond schizophrenia for example if you predict that score for one trait you can
then sort of turn around and say well is this predictive of another trait and what you find is that for example genetic variants associated with
schizophrenia are also associated with bipolar disorder and you know much more
rarely associated with non psychiatric disorders suggesting some sort of common architecture and common genetic basis
for both in fact that's what led to a push for increasing sample sizes when
nothing was coming out of the genome wide Association studies were basically saying listen we know there's genetic signal there let's just push harder and
we will try to find it so doing this basically resulted in many many
different places in the genome associated with disease but some places
are then starting to be multiplied associated with disease basically if you look at mapping of genome-wide
Association study hits across the genome you basically start seeing some low signs that are associated with many many
traits and that was the first motivation for or he was rather than just she was so
what is fee was doing basically she was is asking for every genetic variant
what is it association with a particular trait what fee was is doing is asking instead of putting different genetic
variants on the x-axis and a particular phenotype or each plot what we're gonna
do instead is ask or a particular genetic variance what are all of the
traits that it is associated with so it's actually flipping G us around and it's basically saying what is the Phenom
wide association between a bunch of different phenotypes and a snip instead
of asking for the association between a bunch of snips and a phenotype it is asking what is the association between a
bunch of phenotypes and a snap okay raise your hands if you're with me so
far awesome great so that's what fee was is doing it's basically flipping things around and you can look for you know
this particular snip which sitting in these lockers what is its associations and you can see
that suddenly there's a lot of different traits that are lighting up there's basically both new plastic traits such
as you know carcinomas in new plasm non-melanoma skin cancer and so on and so forth as well as your meta logicals
traits such as you know here at osis order mitosis or solar-thermal ptosis
and so on and so forth so that is basically saying hey wait a minute through these hormone genetic Berrien I
can associate you know skin disorders
with melanoma disorders and of course biologically this makes a lot of sense but it's nice that it's sort of coming
out of the data directly and this can provide a systematic approach basically look for trade trade relationships
through the genetic parents that they share here's an example of leukoplakia and then
keratosis here's another example where you can see sort of many more cardiovascular traits as well as
grandmother poetic traits which are certain metabolic traits which are very
closely related to each other so myocardial infraction coronary atherosclerosis as well as type 2
diabetes for example are you know both associated with the same look suddenly
you are linking together phenotypic traits that would otherwise be you know separate and studying in isolation and
here's another example from the HLA locus this is you know a lot of where the major histocompatibility complex
sits and where a lot of these serve variants affect the immune system
functioning of different individuals and you can see here that it has a very large number of associations with type 1
diabetes with authorities with you know all kinds of other traits ok any
questions so far to do the same across these building
codes across these electronic health record billing codes so you know you can basically look at all of these codes and
then ask you know how are different trades associated with each other and you can see here for example multiple
sclerosis is associated with all kinds of or cost associated with all kinds of additional traits so basically the way
to think about Feliz is you know to kill many birds with one stone basically have you know a single stone which is snap
and you can look at all of the different birds the you know phenotypes that are associated and you can use that from the
very discovery phase you can basically say well if I have multiple phenotypes to start with instead of thinking us of
a single phenotypic variable in single vector of the phenotype for a particular person let me think of a matrix instead
of n individuals and instead of having only one trait having d traits at the
same time and then you know you can correlate that with the genetic information for those individuals and in
mark and genotype matrix x and then the effect sizes of each of these variants
on each of these traits so basically this becomes a matrix operation or
transforming a matrix to a matrix rather than sort of individual vectors for each
of those and then of course at the same time you have this random effect matrix for M by n videos and then this
environmental component which can be sort of capturing the genetic variation and so you can basically model the
multiple traits using these multivariate linear mixed models instead of univariate in a linear mixed models and
when Matthew Stephens in his group did that in Nature Methods in 2014 they basically
so that depending on the number of phenotypes that are affected and the
number of traits that are associated with them you can basically get a huge boost in power or basically doing multi
trade modeling compared to you know a single trade model basically if you model for traits at a time versus only
two traits at a time you can see these you know dramatic increasing power and increase with parents it's paying for
every snip basically using this multi phenotype on everybody with me so far
so there are many different ways of doing this you can basically study the relationship between different traits by
basically asking hey are the co associated with the same snip are they you know can I sort of boost their power
of individual G was discovering by using them jointly or you can use this LD
score regression and we talked about in our heritability lectures basically calculate the shared genetic component
of multiple traits so this particular case you can basically study pairs of
traits and then ask you know what is the number of individuals for example and then what is the LD score for those
individuals which is basically summing up the associations in the correlation patterns for every snip in that block
and then using that you can infer the LD score for each snip across pairs of
traits and then you can use this LD
score correlation in estimating the relationship between different traits so
for example if you look at you know if somebody has ever smoked you know you
can ask what is the coal Association at a genetic level using LD score
regression with BMI or with childhood obesity or with fasting glucose level or
type 2 diabetes you can see here how each of these traits is basically showing these significant Co
associations in blue here you can see groups of traits are very highly correlated
for example height and then the heads your conference of an infant or the length at birth or the weight at birth
or in fact all very very highly correlated with each other and this is basically by asking what is the shared
genetic component of these traits across
the genome even if you know any one of them might not be genome or significant you can still study the correlation and
find Crohn's disease and ulcerative colitis for example which are you know both no immune traits with the intestine
HDL cholesterol and the age of menarche when the period start you can look at
you know BMI and childhood obesity or fasting glucose and type-2 diabetes you
know depression bipolar and schizophrenia see this huge share channel on everybody with me so far so
you can also do this at the region you can basically go region to region and
then compute what is the base factor how much more likely am I to be associated
with the disease compared to you know by chance or different models and then
calculate the similarity between trades based on you know distinguishing models
of sharing or defect and again this leads to very similar results of sort of groups of trades in
this particular case a lot of different logical trades and anthropological trades sort of related proper metric
related feature and what people realize a job we grow is that there's a very
cool thing that you can do with that you can basically say if I take all of the variants that are truly associated with
VMI do I then see whether they have an effect on triglycerides and I can do the
same basically asking you know if I take all triglyceride Barons you know do they have an effect on BMI and what he found
is a very interesting asymmetry between pairs of traits basically though in in
one direction conditioning basically ascertaining specifically or BMI led to
a correlation with the effect size on triglycerides or as in the opposite
direction if I only look at triglyceride associated parents there was no correlation between their effect size
and triglycerides and effect size EMI and that basically led him to postulate that in fact that was evidence of
directionality of causality that basically if BMI variants truly have an
effect on triglycerides then there will be a strong correlation of the effect sizes when I focus on BMI variants but
no correlation when a folks in triglycerides and you know similarly for LDL cholesterol and coronary artery
disease you basically see that you know focusing on one listed correlation for the other loses that correlation so that
allows us to now start doing causality causal inference from the you know you
need directionality in these conditional associations
was with me so far here are some
additional examples so BMI has an effect on type 2 diabetes but not the other way
around so basically again PMI is a huge risk factor or the mass index huge risk factor for type 2 diabetes but type 2
diabetes does not lead to obesity and conversely having created a normal
thyroid thyroid activity basically is associated with height but not the other
with inverse inversely with height but not the other way around so you can
actually use that to start inferring pleiotropic variants by basically looking at every combination of settings
between pairs of Smith's and their alleles and then asking what is the you
know corresponding effect size on each of the different phenotypes by enumerate
the other approach is to basically start asking about partitioning of the genome
into its phenotypic associations so these are concept here is that we can
now start doing genetics at the system's level basically looking at the whole genome each time and many many different
phenotypic Association so what you know systems genetics tries to do is basically characterize multiple
polygenic associations for complex disorders basically define what is the
normal range define the correlation between genetic variation stress and this was quite a pioneering paper by
Jonah doe proposed more than seven years ago basically start understanding or morbidity pattern for Association
patterns of different traits more recently there's you know several people
who are still wondering hey what is what has come of that field is very systems genetics today there's still a lot of
debate about that so basically no sure we can characterize systems-level
genetic outcomes just not one gene at a time then we can integrate data across
many different modalities we're still missing a lot of sort of the inference of networks
in correlations with larger and larger bio banks that
have genotyped individuals across thousands of trades each profile for thousands of trades we can actually
start going step beyond that and truly deliver on the process on the promise of genetics so in the UK value bank for
example we can look at you know hundreds of different trades and in this
particular case were selecting 47 high robustness once that alkies price and



last part that I want to talk about today is the concept of inferring these
meta phenotypes and then imputing individual phenotypic variables using
those methods so in this particular case we're not gonna focus on genetics we're
only gonna fill up some the phenotype come straight from the other coming
and the reason for that is there are truly millions of individuals in
different hospital systems such as you know the Mayo Clinic Partners HealthCare
see here the Geisinger system I you know
Kaiser Permanente all of these are in fact containing millions of individuals but only a small small fraction of those
have been genotype so basically we're looking at you know 10,000 50,000
100,000 for some of the largest cohorts but still nowhere near the millions so the genotypes are often not available
over the large patient cohorts but given the causal mediating phenotypes diseases
of interest are conditionally independent of genotype so electronic
health records basically allow deep phenotyping to basically start inferring what are the true underlying biological
roots so the idea here is that if you had such an intelligent system you can
basically sort of you know train the model based on existing data and then use it to make clinical recommendations
for different systems you could recommend lab tests could recommend
specific diagnosis doctor that could then verified by the professionals or you could recommend prescriptions and
even treatment procedures if you could do that systematically so this is what
you know we've been trying to do so in this particular case you can see the comorbidity patterns of different
phenotypes this is a plot generated by Jose Davila in collaboration with really and for each of those modules of
comorbidity patterns you can basically see what are the biological terms
associated with corresponding icd-9 code and you can see here sort of you know in
a color-coded fashion how these are clustering with each other in terms of
their formal reading patterns so then the intuition here is that we're going to be factorizing that matrix so
basically across individuals and across phenotypes we're going to be basically looking for a lower dimensional
representation of these associations which can then allow us to infer these
latent factors and these can correspond to both patient to patient similarities
or clusters of patient that sooner biological phenotypic variables as well
as clustering of phenotype together when they have association with similar sets
of patients so then the idea here is that we can sort of move in any one of these
dimensions of these multi-dimensional objects on one hand you can basically look at the similarity of different EHR
features and then the underlying latent disease topic so these are the meta
phenotypes which we can learn from multiple combinations of records and
also the patient dimensionality here so basically by looking at the correlation
between different EHR features or crustations we can infer latent disease topics and similarly by looking at the
correlation between patients across EHR features we can infer patient groups
that are you know more more robust in the information for any one fish
so a very common approach in this particular methodology is to actually
group words by their topic so bags of words kind of approaches basically say
you know all these different words are associated with the arts all these different words are associated with a
budget all these different words are associated with children or with education and by basically simply
counting how frequently are each of these words appearing the particular document you can then classify documents
according to their topic everybody with me here so basically what we're gonna do
is a very similar approach where we're now going to learn healthcare topics
associated with different words and then infer the disease topic of each person
based on the set of words that their clinical record contain
challenges the first challenge is that electronic health record data are extremely noisy and they're also
extremely biased so unlike corpus documents in text mining clinical notes
are full of typos or full of arbitrary abbreviations so here you know you can
see hypertension and then coronary atherosclerosis native CH s sorry CHF
you know and B observe suspect info I
mean these are just you know extremely basically arbitrary abbreviations of you
know different notes if you look at the clinical notes themselves is you can see continued you know Hospit video and so
forth so basically text mining and it's usually used to remove stop words here
there's no clearly defined sup word and then the billing codes are not meant to be disease specific they're just used
for billing so how do you deal with all that so one way to do it
sorry and the second challenge is that the EHR data are extremely sparse if you basically ask what fraction of patients
has which phenotypes you see that if the vast majority of phenotypes are found in
less than you know each in less than 5% of the patients we have basically 30,000
phenotypes there are each associated with you know less than 5% of the patients and then you have some
phenotypes that are associated with a huge huge number of patients and same for icd-9 codes labs and so forth so
that basically means that somehow we have to deal with this extreme extreme sparsity where most of the data is
simply never ascertained as most of these variables are simply empty in many
of the individuals everybody with me for challenge number two and then number three is that missing data is
extremely extremely biased so basically if you look at the fraction of
individuals for which the you know particular variable was measured for
example whether the lab result was taken you see that the you know four so these
are different lab tests 192 lab tests each column is one lab test and then
you're asking what fraction of people showed a normal score versus what
fraction of people were in the top 1% or the bottom 1% you can see here that 20%
of the people are in the top 1% and another 20% of the people and at the bottom 1% how is that possible it's
possible because we are pretty good at you know predicting whether something is normal or not and therefore when a test
is prescribed most of the time we expect to see an abnormal result
and therefore when I see that you know this 20% of the patients had in fact an
abnormally high and an abnormally low test that basically means that I cannot
use the observed data as a way to build the true underlying distribution of
different variables because then I would be greatly amplifying the extremes of
the distribution everybody with me here so
you know what do we need to do basically what we need is you know just different ways of specifically dealing with the
non missing at random problem so not missing in random is something that has
been you know well recognized if you look at for example the users that
select music ratings what they basically you know if you if you look at their
users you basically see this if you ask randomly what should you know my score P
but you will find is that very few people give the best score in fact you
know if you randomly assign music to people you will see that most of the time they're not gonna like what they
listen to and then some of the time they're gonna like it a lot but if you look now at the music that
people actually listen to they self select songs that they expect like but
then they end up with a much higher proportion of 5 stars and 4 stars and 3
stars than what you would expect if you were assigning these random everybody
with me so far so that basically means that if the data is missing that usually means that
they're not gonna like that song because they haven't even bothered playing that
song does that make sense so in the same way with clinical data if
something is missing very likely to be you know normal worse if something is
observed it's much more likely to be abnormal and in this particular case the observed data is very frequently
abnormal sort of you know really like the song I'm more likely to observe so
there's been a lot of work in sort of explicitly modeling the pattern of
missingness of basically studying the the driver mechanism underlying this
missingness pattern so the concept here is that you can actually distinguish
different missing mechanisms so you can basically have a mechanism of missing
completely at random whereby the you
know the data is not the missing data is neither driven by the observed data nor
driven by the missing data they're simply missing at random which basically says that the missing indicator only
depends on the observed results but not on the missing results themselves and then there's also non missing at random
where you know we have to actually explicitly model both the missing and
the non missing data so that basically means that the value of the missing data
is in fact influencing whether the data is missing or not so as we saw before
the tests that are missing are more likely to be normal and they set the
tests that are observed or more likely to be abnormal everybody with me so far
so basically if you look at these model parameters basically this is you can
explicitly create a variable associating with missingness you can basically say
here's an indicator variable it will tell me whether the data is missing or not and then I can have a partially
observed result knowing what cluster the patient belongs in and then each
both about the true underlying value and the the probability that I would even
observe that variable and that's the whole concept basically first grouping
the patients into clusters we can then learn which clusters are associated with
abnormal scores and then do a cluster conditional estimation of the values for
specific lab tests so that if the lab test is observed that will allow me to
sort of cluster the patients based on all of the phenotypic information for those patients but also the you know lab
test itself so I can you know start fostering the patients and then within that cluster I can sort of say well the
probability with which the variable is missing is in fact highly dependent on
the cluster members raise your hands if you're with me awesome so then the key
idea is to sort of code the missing indicator as part of your model you could have the missing indicator as part
of your data itself or you could simply have it as a latent variable within the model and here you can see that you know
when you look at overall prediction of medical prescriptions for example you
can see how many variables do include missing this information from four you
can see here that you know after five variables or so there's continued improvements in the overall prediction
based on the re area under the silver pre-earth which basically suggests that
indeed modeling the missingness patter is extremely important okay then I can
jointly model both the lab test assignment whether it was prescribed a
particular lab test regardless of its value and what the actual result of that
Baptist was so for example if I belong in a set of patients with respiratory problems
I'm much more likely to be asked to have a complete blood cell count and that
completely docile cow is much more likely to be abnormal and same for c-reactive protein if I have
kidney problems similarly you know much more likely to be prescribed urea nitrogen test then you know if I don't
so you can actually start modeling this I said of having the patient cluster
dictates both the lab result and lab presence itself and having a lot
presence dependent on the lab result but of course I don't observe that directly but knowing the probability of the
patient belonging that particular cluster I can actually flow information on so that's basically an example of
explicitly modeling the missing mechanism basically says was this lab observed or not sounds good so then you
know we can do this across many different phenotypic variables basically
using both love presence and lab results I can basically learn hyper parameters
that sort of dictate the probability that you know particular lab variable was observed or not and the value of
that variable so then we can jointly model these missing indicators and the
lab test results whereby the observation for a particular patient depends on a
score with which that particular patient belongs in the particular topic the
score likelihood for that particular question of that lab test and the
missing indicator like or that particular level but then basically says
that the number of patients assigned to a particular topic when L is observed
versus well L is not observed are in fact highly dependent on that class so
then you can do that both for lab tests and or all of the other phenotypes basically looking at you know various
binary clinical features looking at you know all kinds of icd-9 codes and etc
you can basically start Modelling number one the patient Mehta phenotype the class the topic over the
disease topic that the patient belongs in is the words their phenotypic past their meta phenotype and conditioned on
that class both the frequency of the different phenotypes and whether they
are taking a particular test and what the results of that test will be that's
good basically I can use this relatively simpler framework for all of the phenotypes and then this relatively more
complex framework or the labs because of the prescription bias for the lab itself
so then what you le did in the group is he developed this model called mix EHR
or this mixture model for electronic health records that now allows you to do
exactly what was in the previous slide basically you know model is meta
phenotype and then treat this missing indicator for each of the variables as
well as all of the additional affinity asks so what he does is that he
decomposes the matrix of patients by lab tests by icd-9 billing code by
questionnaire prescription treatment and all of the clinical notes and then for
each of these different data modalities he has a particular factorization and then a loading matrix that allows you to
combine these different modalities together so then he can predict the risk for each patient belong in each class of
mental phenotypes and then infer the you
know the disease class for each for each person so you know cutting across this
you can basically see your g-code icd-9 codes you know lab prescriptions
clinical notes and medical prescriptions and then sort of have this hierarchical
model allows you to at the top level infer the meta phenotype and then within each of the data modalities for each
topic specific enter phenotype again partitioning this phenotypic matrix into
each component so what you can do is then start asking
in simulation whether the model is in fact capturing you know simulated data
and we can see is that the non missing at random component is in fact greatly
are performing the missing at random model basically if we assume that the
random data is randomly missing that the missing data is randomly missing then we do much worse performance whereas if we
assume that the data is not better performance and then you know it agrees
with the true model a much much more frequently you would expect
here's some terms for example that are very heavily associated with different
meta phenotypes so basically here you're learning the different meta phenotypes and then what you can ask is what were
the individual disease phenotypes or associated with that and what you can see is that or every one of these meta
phenotypes there's a large number of different data modalities attributed to this so that basically suggests that I
can you know combine information from clinical notes from labs from
prescriptions from you know billing codes from you know all of these
different data modalities it's shown here in different colors you can also
ask how are these topics correlated with quantitative variables such as the age
of the individual and we can basically asking what are the topics that are the most positively correlated with age and
what are the more topic started the least partly dated it was the most negatively correlated with age and we
can see is that for example heart failure cardiovascular disease and dementia are three of the most
positively correlated with age which are diseases of old age and then if you look
at the most negative correlation with age again these are you know neonate earth and you know preterm birth and so
forth which are you know with the earliest age group and you know what's
really interesting is that preterm is in fact even more negatively associated and
say muni which it's again fully consistent with a biology you can also
ask for different classes of traits for example for schizophrenia associated trades or PTSD associated traits what
are the of scoring terms and here you
can see for example from the clinical notes delusion in schizophrenia and bipolar and you know coach anton and
psych and psych spelled and psych code and you know etc are basically featuring
in the clinical notes of the doctors that are dealing with these patients in with patients in this cluster the icd-9
billing codes seem to agree with that you know and in fact with schizophrenia you also see poison
tranquilizer for example or toxic effect caustic no other detail it so effective
disorder and so on and so forth so you can see here how across different data modalities we're combining terms that
are all strongly indicative of Renea or of post-traumatic stress disorder
everybody with me so far so what you can also do is start prioritizing patients
based on the disease mixture so here for example you can see that there's only a
subset of the individuals who are in this particular cluster that were tagged
with leukemia or with pulmonary embolism or with cirrhosis liver cirrhosis you
can see that there's like grey squares here that where that particular phenotype was not explicitly declared
and if you if you look here you can see why these patients were in fact selected
to be part of that group and you can see here that even though they did not have
you know leukemia explicitly mentioned they had all these other phenotypes that
are highly indicative of leukemia here even though these guys didn't have pulmonary embolism explicitly mentioned
they have all these other phenotypes or highly predictive of that and then you can actually explicitly see what are the
terms color coded according to the evidence type that are associated with
each of these clusters and in clustering 31 for example you can see you know what
are the most informative terms in this word cloud and similarly for topic 35 and for topic 15 you can see that
there's many different types of lines of evidence which are all combined together
to to infer the membership of these
patients and therefore to complete what would otherwise be missing informative F
so that to now start imputing missing EHR codes
so you know we can go from something that's extremely sparse to actually
imputing the values for each of those and you can see here that many of the white caps are getting filled in
suggesting that in fact maybe the doctors should have added that additional term and you can also ask
depending on you know basically you could hide some of these variables
basically say well you know I know that the patient has that but I'm gonna remove that code you know am i able to
recover this and the answer is you know overwhelmingly yes so basically 82 percent of the time if you remove a term
we're able to just no predict back and in fact this is also heavily biased because you know usually when you add
one particular term you know that might be the only the only one who might you you bother marking may not mark it for
us then removing that one term and in fact leads the decreased membership in
that particular class the other thing you can do is in fact predict future mortality so based on the
visits today will the patient die the next time to come to the hospital you
know that can be actually very helpful and the answer is that overwhelmingly
yes so basically you have 85% at UC of actually predicting whether a patient
will hospital so you know if we if we can do this reliably we probably want to
let people know one you know warn them in advance don't wait until your next visit maybe should be you know treated
and you can also ask what are the terms that are the most predictive of mortality you can basically see that you
know if you have a ventricular fibrillation for example or you know cardiac arrest at the last visits or
anoxic brain damage and so forth all of these you're you know in pretty bad
shape for making it through your next visit or renal failure acute necrosis of liver and so forth or
if you're in a mechanical ventilator basically is a good indication that next
time around we won't be able to make it and some and so forth so basically what
this is you know basically telling us is that we can actually make not just
classification of the current state of the patient but also prediction of the
future state of the patient so just to summarize this part basically these phenome wide Association studies can
elucidate phenotypic the networks by linking genetics to their cause Russia
trades you can model multiple trades jointly reveal the genetic correlations
between them and then you can start modeling each our data recognizing that it's extremely rich but there's many
challenges in modeling because of the high dimensionality high sparsity and non-missing at random and then these
machine learning methods are especially helpful and especially generative models and they hold great promise to learn
compress latent dimensions of these super high dimensional data sets and also me impute the missing data to
reveal the underlying what recovered today we basically motivated why we even care about these
multiple phenotypes studied jointly and I introduced the concept of fee was of
basically associating every phenotype with every other phenotype rather than genetic parent with every other
phenotype in the context of specific genetic variants we expanded that to
start looking at LD blocks and their Co Association patterns across an entire clinical record such as from the UK
biobank you went a step further from the
region by region Co Association to a factorization of the genotype-phenotype
correlation matrix you basically understand the different factors of these disease associations we looked at
our epigenomic enrichment analysis that we had seen before but now in the context of multiple genomic innovations
and multiple phenotypic variables simultaneously basically learning the Cowan Richmond patterns of specific
phenotypic classes into similar epigenomic annotations and then using
that to improve the enrichment and improve the pathway inferences and then lastly how we can model these electronic
health records by specifically looking at meta phenotypes basically these
higher-order terms that can combine information from multiple individual
data modalities such as you know clinical notes lab tests prescriptions icd-9 billing codes and so forth and how
we can in fact use that to impute missing data and to also prioritize
patient disease feels that they've learned stuff and awesome one more

Comments

Popular Posts