Reverse engineering of gene regulatory networks (GRNs) has long been an attractive research topic in system biology. Computational prediction of gene regulatory interactions has remained a challenging problem due to the complexity of gene expression and scarce information resources. The high-throughput spatial gene expression data, like in situ hybridization images that exhibit temporal and spatial expression patterns, has provided abundant and reliable information for the inference of GRNs. However, computational tools for analyzing the spatial gene expression data are highly underdeveloped. In this study, we develop a new method for identifying gene regulatory interactions from gene expression images, called ConGRI. The method is featured by a contrastive learning scheme and deep Siamese convolutional neural network architecture, which automatically learns high-level feature embeddings for the expression images and then feeds the embeddings to an artificial neural network to determine whether or not the interaction exists. We apply the method to a Drosophila embryogenesis dataset and identify GRNs of eye development and mesoderm development. Experimental results show that ConGRI outperforms previous traditional and deep learning methods by a large margin, which achieves accuracies of 76.7% and 68.7% for the GRNs of early eye development and mesoderm development, respectively. It also reveals some master regulators for Drosophila eye development. https://github.com/lugimzheng/ConGRI. Supplementary data are available at Bioinformatics online.