In this study, the authors propose a method that combines CNN and LSTM networks to recognize facial expressions. To handle illumination changes and preserve edge information in the image, the method uses two different preprocessing techniques. The preprocessed image is then fed into two independent CNN layers for feature extraction. The extracted features are then fused with an LSTM layer to capture the temporal dynamics of facial expressions. To evaluate the method's performance, the authors use the FER2013 dataset, which contains over 35,000 facial images with seven different expressions. To ensure a balanced distribution of the expressions in the training and testing sets, a mixing matrix is generated. The models in FER on the FER2013 dataset with an accuracy of 73.72%. The use of Focal loss, a variant of cross-entropy loss, improves the model's performance, especially in handling class imbalance. Overall, the proposed method demonstrates strong generalization ability and robustness to variations in illumination and facial expressions. It has the potential to be applied in various real-world applications such as emotion recognition in virtual assistants, driver monitoring systems, and mental health diagnosis.