Early diagnosis and treatment of colorectal polyps are crucial for preventing colorectal cancer. This paper proposes a lightweight convolutional neural network for the automatic detection and auxiliary diagnosis of colorectal polyps. Initially, a 53-layer convolutional backbone network is used, incorporating a spatial pyramid pooling module to achieve feature extraction with different receptive field sizes. Subsequently, a feature pyramid network is employed to perform cross-scale fusion of feature maps from the backbone network. A spatial attention module is utilized to enhance the perception of polyp image boundaries and details. Further, a positional pattern attention module is used to automatically mine and integrate key features across different levels of feature maps, achieving rapid, efficient, and accurate automatic detection of colorectal polyps. The proposed model is evaluated on a clinical dataset, achieving an accuracy of 0.9982, recall of 0.9988, F1 score of 0.9984, and mean average precision (mAP) of 0.9953 at an intersection over union (IOU) threshold of 0.5, with a frame rate of 74 frames per second and a parameter count of 9.08 M. Compared to existing mainstream methods, the proposed method is lightweight, has low operating configuration requirements, high detection speed, and high accuracy, making it a feasible technical method and important tool for the early detection and diagnosis of colorectal cancer.
Medical visual question answering (MVQA) plays a crucial role in the fields of computer-aided diagnosis and telemedicine. Due to the limited size and uneven annotation quality of the MVQA datasets, most existing methods rely on additional datasets for pre-training and use discriminant formulas to predict answers from a predefined set of labels. This approach makes the model prone to overfitting in low resource domains. To cope with the above problems, we propose an image-aware generative MVQA method based on image caption prompts. Firstly, we combine a dual visual feature extractor with a progressive bilinear attention interaction module to extract multi-level image features. Secondly, we propose an image caption prompt method to guide the model to better understand the image information. Finally, the image-aware generative model is used to generate answers. Experimental results show that our proposed method outperforms existing models on the MVQA task, realizing efficient visual feature extraction, as well as flexible and accurate answer outputs with small computational costs in low-resource domains. It is of great significance for achieving personalized precision medicine, reducing medical burden, and improving medical diagnosis efficiency.