Football video content analysis is a rapidly evolving field aiming to enrich the viewing experience of football matches. Current research often focuses on specific tasks like player and/or ball detection, tracking, and localisation in top-down views. Our study strives to integrate these efforts into a comprehensive Multi-Object Tracking (MOT) model capable of handling perspective transformations. Our framework, FootyVision, employs a YOLOv7 backbone trained on an extended player and ball dataset. The MOT module builds a gallery and assigns identities via the Hungarian algorithm based on feature embeddings, bounding box intersection over union, distance, and velocity. A novel component of our model is the perspective transformation module that leverages activation maps from the YOLOv7 backbone to compute homographies using lines, intersection points, and ellipses. This method effectively adapts to dynamic and uncalibrated video data, even in viewpoints with limited visual information. In terms of performance, FootyVision sets new benchmarks. The model achieves a mean average precision (mAP) of 95.7% and an F1-score of 95.5% in object detection. For MOT, it demonstrates robust capabilities, with an IDF1 score of approximately 93% on both ISSIA and SoccerNet datasets. For SoccerNet, it reaches a MOTA of 94.04% and shows competitive results for ISSIA. Additionally, FootyVision scores a HOTA(0) of 93.1% and an overall HOTA of 72.16% for the SoccerNet dataset. Our ablation study confirms the effectiveness of the selected tracking features and identifies key attributes for further improvement. While the model excels in maintaining track accuracy throughout the testing dataset, we recognise the potential to enhance spatial-location accuracy.