Advances in deep learning have shown promising potential in scalable video analytics in the cloud. However, in constrained settings, it is not feasible to send every video frame from cameras at the edge. Being selective about which frames should be sent to the cloud reduces bandwidth consumption, conserves energy, and protects user privacy. Existing solutions to edge-based filtering primarily focus on object detection and train binary classifiers to suppress transmission of irrelevant frames. This is both hard to scale when the queries of interest can change rapidly and infeasible for complex queries specified with natural language. In this paper, we propose mmFilter, a multimodal approach for filtering video streams matching predefined events (defined via natural language queries) at the edge. mmFilter learns compact representations for video and text data and automatically matches semantically similar pairs in a joint embedding space. Our model generalizes to various unseen queries, allowing new, complex queries to be added to the system in real time without retraining. We have implemented and evaluated our system on popular video datasets such as ActivityNet Captions and MSRVTT. mmFilter has shown a 1.5 imes× improvement in event detection accuracy comparing with the state-of-the-art filtering approach.