Lab Meeting 1/24/2028

Haonan Yao efficient pipeline for extraction for clinical notes

Amir Postdoc salary has increased

Postdocs went on strike all worked out.

Thanks for sharing

Haonan:

Sample clinical notes, large language models, fine tune large language model, depending on performance of model choose samples for additional testing

Train minimal nodes to achieve optimal performance

10,000 notes, ensure the notes as wide range of scenarios. Training approaches. Improve ability

Different scenarios, different, diverse notes, robust to deal with different clinical situations.

Take clinical notes and find the 10 most useful notes, extract data we are interested in

Look at what the model hardest types of notes to work for, minimize the amount of labeling per task

Get away with 200 or 300 labeled notes, do 10, 20 or 30 notes. Scale the number of data extraction tasks that we do.

Select the most diverse notes.

Notes, dataset, 10,000 MIMIC 4 data. This dataset. Sample of original data. Combined with discharge notes, with hadm_id

Hadm-id code each note there is unique id included lots of ICD codes.

Want to label as few notes as possible

Ultimately task is to extract diagnoses from a procedure note, pull symptoms from HPI, maybe not happy with zero performance

Instructions not giving good enough job, models slow expensive want to do on 100,000 notes

Give it 20 examples, 30 examples the most efficient example can bypass those obstacles for the specific tasks. Technical things to work for, general concept can we perform most efificent lable sot help the model learn from it

Code K-means + PCA visualization

Samples, clustering algorithm identify different categories, sample each group contributes

Text code matrix, each node unique ICD code.

Relevance of specific ICD code in a particular note

Relatively small number of clusters, specifically 5. Principle component analysis for visualization purpose

Contribution of clinical notes

Following

Approach respectively visualize and understand distribution of clinical notes

Sample selection

Enhass models generalizability and robustness

Algorithm to convert ICD 10 codes into a vector

Tfidf vectorizer convert the text into a vector for ICD commands matrix,

Tfidf_vectorizer words into single numeric

500 vector segment of a word, turn data into a number

Icd codes,

Icd codes called each

Combined matrix with note and ICD code. Directly convert icd codes

Is this method semantic meaning of icd 10 code

Lets say this code is 1st trimester complications

Another code different, nonsensical, may not know unless you are embedding systems are trained that those two are related.

That will mess. Up how you set this up, may not know what is similar or dissimilar

Words may work, make sure ICD can work

Embedding

Datavec

Doesn’t have perfect mapping, should work on mimic data

Mimic vec, download the embedding dictionary, look up the code in the dictionary and now you have 500 dimension vector.

https://www.medrxiv.org/content/10.1101/2023.04.24.23289046v3.full.pdf

https://github.com/kaneplusplus/icd-10-cm-embedding

https://arxiv.org/abs/2204.10408

mimic does not have ICD 10 codes, has ICD 9

https://arxiv.org/abs/2204.10408

https://arxiv.org/abs/1906.05492

cui2vec

primary focus on manual annotation of clinical notes

select 100 clinical notes from 5 existing clusters, calculate and select the 10 least similar of these notes for detailed annotation

ensures that we cover a wide range of data

that are critical to testing the robustness of the model

focus on dissimilar note to enhance model

accuracy and generalizability for effective

cosine similarity notes

100 note diagram, 100 notes very large to describe the capture description

By using this method, identify the 10 disimilar notes

Cosine similarity method, data driven approach

Identify notes with greatest different.

Fine tuning process may involve maximum adjustment, modifying certain layers of the structure of the model to better fit requirement

Fine tuning the model performance

Test data to ensure it achieves the desired level of effectiveness.

Inference use trained model to predict new data

New further change is performed

Evaluate overall change of model in real world scenarios. Use fine=tuned model to fine tune dissimilar notes, which notes have the greatest on the impact of the fine tuning model

Can help us to understand the strength and weakness of the model

How to respond to a variety of clinical scenarios

Improving practical and reliability of model

Optimize the

Batch process the data in smaller batches that fit into the memory

Cause the kernel to crash, learn this part

Limitation of K means basic framework for clinical work

Effectiveness limited by

Stability issues and challenge creating results in a clinical way

Third step – more advanced method complexity of clinical notes

Advanced vectorizers and combined with UMAP offer enhanced feature extracture

Effective handling of high-dimensional data, improved visualization and better clustering

By using this method with 10,000 data

Cannot display visualization correctly

Inference

Slide 3 Overall picture

User interface, expert does the labeling to help train the model

Batching is a skill, how to set up code so that you do things in groups of 100

https://figshare.com/articles/dataset/Pre-trained_cui2vec_embeddings/6082922

https://github.com/kaneplusplus/icd-10-cm-embedding

https://www.scdiscoveries.com/blog/knowledge/what-is-a-umap-plot/#:~:text=UMAP%20is%20an%20algorithm%20that,expression%20counts%20per%20individual%20cell.

How does umap work

https://umap-learn.readthedocs.io/en/latest/plotting.html

embedding of ICD 9

embed the ICD code

cui2vec compared to the one just posted

have better performance

Do better against multiple metrics

Compare

https://pypi.org/project/icdcodex/

vector of list of ICD codes to another list

uses a graph network representation.

Structure of icd coding

List of codes embedder.to_vec generate an array

Model uses network

node2vec r

ICD

ICD 10 codes how to represent

Series to represent icd 10 code conceptualize it as a graph network, want to treat it like a large language model, predicts next word

Model to predict next sequence

Definition

Pick data support it

Transform

Embedding idea, words concept into an arrow in some direction meathematical space

2 dimensions or 1000s

Assigned a vector that you are interested in

ICD 10 codes

Next job

How you are going to compare

Cosine similarities

Compare differences between vectors

Embedding chunk split texts

Models corrected how well they do chunksword embeddings

ICD 10 codes

Whatever unit interested in turn it into a number

Base architecture

Embedding models

Series of numbers, as a graph network how you think of the idea

Different ideas, embedding models

General concept

Turn original data into a vector

Vector analysis with logistic regression

Image vectors, each frame

Text turn

How to represent mathematical representation

Text into mathematical vector

Sentence mathematical representation

A good embedding model will turn things similar to another

Turn these sentences into vectors

Embedding model

Create some context knowledge from your sentence

Get something similar

A good embedding model

Ask open ai something takes your text into mathematical representation

Predict the next work. It is trained big dataset of text

Mlm mass, mathematical

Math layer

Fine tuning LLM

Removal the travel and force the lm

Correct it will train it well

Billions of text and come up with llm

Embedding

To scan these texts into representations

Mathematics on the model

User interface

Chat gpt plus

New custom agents in chat gpt store

More trouble shooting

Similar format to have

Some of the agents will do that for you

Play around for that

Can make your own, do a basic version, tell it what you want it to do, narrow chat gpt functionality to make it more useful

SamuelYHuang

Search This Blog

Lab Meeting 1/24/2028

Comments

Post a Comment