Lab Meeting 1/24/2028

Haonan Yao efficient pipeline for extraction for clinical notes


Amir Postdoc salary has increased 


Postdocs went on strike all worked out.


Thanks for sharing


Haonan: 



 


 



Sample clinical notes, large language models, fine tune large language model, depending on performance of model choose samples for additional testing


Train minimal nodes to achieve optimal performance




 


10,000 notes, ensure the notes as wide range of scenarios. Training approaches. Improve ability

Different scenarios, different, diverse notes, robust to deal with different clinical situations.


Take clinical notes and find the 10 most useful notes, extract data we are interested in

Look at what the model hardest types of notes to work for, minimize the amount of labeling per task


Get away with 200 or 300 labeled notes, do 10, 20 or 30 notes. Scale the number of data extraction tasks that we do. 


Select the most diverse notes.


Notes, dataset, 10,000 MIMIC 4 data. This dataset. Sample of original data. Combined with discharge notes, with hadm_id


Hadm-id code each note there is unique id included lots of ICD codes.


Want to label as few notes as possible

Ultimately task is to extract diagnoses from a procedure note, pull symptoms from HPI, maybe not happy with zero performance

Instructions not giving good enough job, models slow expensive want to do on 100,000 notes


Give it 20 examples, 30 examples the most efficient example can bypass those obstacles for the specific tasks. Technical things to work for, general concept can we perform most efificent lable sot help the model learn from it



Code K-means + PCA visualization


Samples, clustering algorithm identify different categories, sample each group contributes


Text code matrix, each node unique ICD code.


Relevance of specific ICD code in a particular note 



Relatively small number of clusters, specifically 5. Principle component analysis for visualization purpose

Contribution of clinical notes

Following 


Approach respectively visualize and understand distribution of clinical notes


Sample selection

Enhass models generalizability and robustness


Algorithm to convert ICD 10 codes into a vector


Tfidf vectorizer convert the text into a vector for ICD commands matrix, 


Tfidf_vectorizer words into single numeric


500 vector segment of a word, turn data into a number


Icd codes, 



Icd codes called each 


Combined matrix with note and ICD code. Directly convert icd codes 

Is this method semantic meaning of icd 10 code


Lets say this code is 1st trimester complications

Another code different, nonsensical, may not know unless you are embedding systems are trained that those two are related. 

That will mess. Up how you set this up, may not know what is similar or dissimilar

Words may work, make sure ICD can work


Embedding


Datavec

Doesn’t have perfect mapping, should work on mimic data


Mimic vec, download the embedding dictionary, look up the code in the dictionary and now you have 500 dimension vector.



https://www.medrxiv.org/content/10.1101/2023.04.24.23289046v3.full.pdf


https://github.com/kaneplusplus/icd-10-cm-embedding


https://arxiv.org/abs/2204.10408


mimic does not have ICD 10 codes, has ICD 9



https://arxiv.org/abs/2204.10408

https://arxiv.org/abs/1906.05492


cui2vec



primary focus on manual annotation of clinical notes

select 100 clinical notes from 5 existing clusters, calculate and select the 10 least similar of these notes for detailed annotation


ensures that we cover a wide range of data

that are critical to testing the robustness of the model


focus on dissimilar note to enhance model  


accuracy and generalizability for effective


cosine similarity notes



100 note diagram, 100 notes very large to describe the capture description


By using this method, identify the 10 disimilar notes

Cosine similarity method, data driven approach



Identify notes with greatest different. 



Fine tuning process may involve maximum adjustment, modifying certain layers of the structure of the model to better fit requirement


Fine tuning the model performance

Test data to ensure it achieves the desired level of effectiveness.


Inference use trained model to predict new data


New further change is performed


Evaluate overall change of model in real world scenarios. Use fine=tuned model to fine tune dissimilar notes, which notes have the greatest on the impact of the fine tuning model


Can help us to understand the strength and weakness of the model


How to respond to a variety of clinical scenarios


Improving practical and reliability of model


Optimize the 



 


Batch process the data in smaller batches that fit into the memory

Cause the kernel to crash, learn this part


Limitation of K means basic framework for clinical work

Effectiveness limited by 


Stability issues and challenge creating results in a clinical way

Third step – more advanced method complexity of clinical notes


Advanced vectorizers and combined with UMAP offer enhanced feature extracture


Effective handling of high-dimensional data, improved visualization and better clustering


By using this method with 10,000 data

Cannot display visualization correctly



 


Inference



Slide 3 Overall picture

User interface, expert does the labeling to help train the model 




Batching is a skill, how to set up code so that you do things in groups of 100 


https://figshare.com/articles/dataset/Pre-trained_cui2vec_embeddings/6082922


https://github.com/kaneplusplus/icd-10-cm-embedding



https://www.scdiscoveries.com/blog/knowledge/what-is-a-umap-plot/#:~:text=UMAP%20is%20an%20algorithm%20that,expression%20counts%20per%20individual%20cell.


How does umap work


https://umap-learn.readthedocs.io/en/latest/plotting.html



embedding of ICD 9

embed the ICD code

cui2vec compared to the one just posted

have better performance






 


Do better against multiple metrics


Compare 


https://pypi.org/project/icdcodex/


vector of list of ICD codes to another list


uses a graph network representation.


Structure of icd coding


List of codes embedder.to_vec generate an array




 



Model uses network


 



node2vec r





ICD 


ICD 10 codes how to represent


Series to represent icd 10 code conceptualize it as a graph network, want to treat it like a large language model, predicts next word


Model to predict next sequence


Definition 


Pick data support it



Transform

Embedding idea, words concept into an arrow in some direction meathematical space

2 dimensions or 1000s


Assigned a vector that you are interested in

ICD 10 codes

Next job

How you are going to compare


Cosine similarities


Compare differences between vectors




Embedding chunk split texts


Models corrected how well they do chunksword embeddings

ICD 10 codes


Whatever unit interested in turn it into a number

Base architecture


Embedding models

Series of numbers, as a graph network how you think of the idea

Different ideas, embedding models


General concept

Turn original data into a vector

Vector analysis with logistic regression



Image vectors, each frame


Text turn 

How to represent mathematical representation 


Text into mathematical vector


Sentence mathematical representation




 


A good embedding model will turn things similar to another

Turn these sentences into vectors


Embedding model

Create some context knowledge from your sentence


Get something similar

A good embedding model

Ask open ai something takes your text into mathematical representation


Predict the next work. It is trained big dataset of text

Mlm mass, mathematical


Math layer


Fine tuning LLM


Removal the travel and force the lm

Correct it will train it well



Billions of text and come up with llm


Embedding

To scan these texts into representations


Mathematics on the model


User interface

 



Chat gpt plus

New custom agents in chat gpt store



More trouble shooting



Similar format to have

Some of the agents will do that for you


Play around for that


Can make your own, do a basic version, tell it what you want it to do, narrow chat gpt functionality to make it more useful


Comments