Haonan Yao efficient pipeline for extraction for clinical notes
Amir Postdoc salary has increased
Postdocs went on strike all worked out.
Thanks for sharing
Haonan:
Sample clinical notes, large language models, fine tune large language model, depending on performance of model choose samples for additional testing
Train minimal nodes to achieve optimal performance
10,000 notes, ensure the notes as wide range of scenarios. Training approaches. Improve ability
Different scenarios, different, diverse notes, robust to deal with different clinical situations.
Take clinical notes and find the 10 most useful notes, extract data we are interested in
Look at what the model hardest types of notes to work for, minimize the amount of labeling per task
Get away with 200 or 300 labeled notes, do 10, 20 or 30 notes. Scale the number of data extraction tasks that we do.
Select the most diverse notes.
Notes, dataset, 10,000 MIMIC 4 data. This dataset. Sample of original data. Combined with discharge notes, with hadm_id
Hadm-id code each note there is unique id included lots of ICD codes.
Want to label as few notes as possible
Ultimately task is to extract diagnoses from a procedure note, pull symptoms from HPI, maybe not happy with zero performance
Instructions not giving good enough job, models slow expensive want to do on 100,000 notes
Give it 20 examples, 30 examples the most efficient example can bypass those obstacles for the specific tasks. Technical things to work for, general concept can we perform most efificent lable sot help the model learn from it
Code K-means + PCA visualization
Samples, clustering algorithm identify different categories, sample each group contributes
Text code matrix, each node unique ICD code.
Relevance of specific ICD code in a particular note
Relatively small number of clusters, specifically 5. Principle component analysis for visualization purpose
Contribution of clinical notes
Following
Approach respectively visualize and understand distribution of clinical notes
Sample selection
Enhass models generalizability and robustness
Algorithm to convert ICD 10 codes into a vector
Tfidf vectorizer convert the text into a vector for ICD commands matrix,
Tfidf_vectorizer words into single numeric
500 vector segment of a word, turn data into a number
Icd codes,
Icd codes called each
Combined matrix with note and ICD code. Directly convert icd codes
Is this method semantic meaning of icd 10 code
Lets say this code is 1st trimester complications
Another code different, nonsensical, may not know unless you are embedding systems are trained that those two are related.
That will mess. Up how you set this up, may not know what is similar or dissimilar
Words may work, make sure ICD can work
Embedding
Datavec
Doesn’t have perfect mapping, should work on mimic data
Mimic vec, download the embedding dictionary, look up the code in the dictionary and now you have 500 dimension vector.
https://www.medrxiv.org/content/10.1101/2023.04.24.23289046v3.full.pdf
https://github.com/kaneplusplus/icd-10-cm-embedding
https://arxiv.org/abs/2204.10408
mimic does not have ICD 10 codes, has ICD 9
https://arxiv.org/abs/2204.10408
https://arxiv.org/abs/1906.05492
cui2vec
primary focus on manual annotation of clinical notes
select 100 clinical notes from 5 existing clusters, calculate and select the 10 least similar of these notes for detailed annotation
ensures that we cover a wide range of data
that are critical to testing the robustness of the model
focus on dissimilar note to enhance model
accuracy and generalizability for effective
cosine similarity notes
100 note diagram, 100 notes very large to describe the capture description
By using this method, identify the 10 disimilar notes
Cosine similarity method, data driven approach
Identify notes with greatest different.
Fine tuning process may involve maximum adjustment, modifying certain layers of the structure of the model to better fit requirement
Fine tuning the model performance
Test data to ensure it achieves the desired level of effectiveness.
Inference use trained model to predict new data
New further change is performed
Evaluate overall change of model in real world scenarios. Use fine=tuned model to fine tune dissimilar notes, which notes have the greatest on the impact of the fine tuning model
Can help us to understand the strength and weakness of the model
How to respond to a variety of clinical scenarios
Improving practical and reliability of model
Optimize the
Batch process the data in smaller batches that fit into the memory
Cause the kernel to crash, learn this part
Limitation of K means basic framework for clinical work
Effectiveness limited by
Stability issues and challenge creating results in a clinical way
Third step – more advanced method complexity of clinical notes
Advanced vectorizers and combined with UMAP offer enhanced feature extracture
Effective handling of high-dimensional data, improved visualization and better clustering
By using this method with 10,000 data
Cannot display visualization correctly
Inference
Slide 3 Overall picture
User interface, expert does the labeling to help train the model
Batching is a skill, how to set up code so that you do things in groups of 100
https://figshare.com/articles/dataset/Pre-trained_cui2vec_embeddings/6082922
https://github.com/kaneplusplus/icd-10-cm-embedding
https://www.scdiscoveries.com/blog/knowledge/what-is-a-umap-plot/#:~:text=UMAP%20is%20an%20algorithm%20that,expression%20counts%20per%20individual%20cell.
How does umap work
https://umap-learn.readthedocs.io/en/latest/plotting.html
embedding of ICD 9
embed the ICD code
cui2vec compared to the one just posted
have better performance
Do better against multiple metrics
Compare
https://pypi.org/project/icdcodex/
vector of list of ICD codes to another list
uses a graph network representation.
Structure of icd coding
List of codes embedder.to_vec generate an array
Model uses network
node2vec r
ICD
ICD 10 codes how to represent
Series to represent icd 10 code conceptualize it as a graph network, want to treat it like a large language model, predicts next word
Model to predict next sequence
Definition
Pick data support it
Transform
Embedding idea, words concept into an arrow in some direction meathematical space
2 dimensions or 1000s
Assigned a vector that you are interested in
ICD 10 codes
Next job
How you are going to compare
Cosine similarities
Compare differences between vectors
Embedding chunk split texts
Models corrected how well they do chunksword embeddings
ICD 10 codes
Whatever unit interested in turn it into a number
Base architecture
Embedding models
Series of numbers, as a graph network how you think of the idea
Different ideas, embedding models
General concept
Turn original data into a vector
Vector analysis with logistic regression
Image vectors, each frame
Text turn
How to represent mathematical representation
Text into mathematical vector
Sentence mathematical representation
A good embedding model will turn things similar to another
Turn these sentences into vectors
Embedding model
Create some context knowledge from your sentence
Get something similar
A good embedding model
Ask open ai something takes your text into mathematical representation
Predict the next work. It is trained big dataset of text
Mlm mass, mathematical
Math layer
Fine tuning LLM
Removal the travel and force the lm
Correct it will train it well
Billions of text and come up with llm
Embedding
To scan these texts into representations
Mathematics on the model
User interface
Chat gpt plus
New custom agents in chat gpt store
More trouble shooting
Similar format to have
Some of the agents will do that for you
Play around for that
Can make your own, do a basic version, tell it what you want it to do, narrow chat gpt functionality to make it more useful
Comments
Post a Comment