Enterprise Master Patient Index Entity Recognition by Long Short-Term Memory Network in Electronic Health Systems

Named-entity recognition (NER) is the application of information extraction by artificial intelligence (AI) to locate and classify conceptual entities from natural language into pre-defined categories. In this study, we apply the Long Short-Term Memory network (LSTM) networks to identify the patient entities from the Enterprise Master Patient Index (EMPI). A sample dataset with 300,000 deidentified patient records is used to test the LSTM performance for EMPI entity recognition. The data entries are firstly converted into strings and represented by a Word2Vec model with 200 dimensions. Two LSTM models are developed for the NER recognition problem. The first LSTM model uses a multiclassifier with a softmax function, the second LSTM model uses a two-step classification procedure by binary logistic function. To evaluate the LSTM performance, we use a conventional deep neural network model for comparison, where the Levenshtein distance is used to represent the training data patterns. The classification performance is evaluated by ten-fold cross-validation. The two-step LSTM model has the classification accuracy of 99.82%, which is superior to both the multi-classification LSTM classifier at 61.08% and to the conventional deep neural network at 95.08%. Therefore, we conclude that the new two-step LSTM model provides an accurate and reliable solution to recognize the EMPI patient entities when it is properly configured and trained.