Scalable Big Data Analytics and Fare Prediction for NYC Taxi Trips Using Distributed Computing and Machine Learning
DOI:
https://doi.org/10.62951/ijies.v3i1.97Keywords:
Big Data, Spark, Databricks, NYC Taxi, Predictive ModelingAbstract
This study develops a scalable big data analytics framework to process and analyze the New York City (NYC) Taxi Trip dataset using distributed computing and machine learning techniques. The objective of the research is to generate operational insights from large-scale transportation data and to build an accurate predictive model for total fare estimation. The dataset consists of integrated Green Taxi and Yellow Taxi trip records containing temporal, spatial, and financial transaction attributes. Data preprocessing was conducted through cleaning, schema harmonization, anomaly filtering, and enrichment using taxi zone lookup information. Descriptive analytics was performed to examine demand trends, trip behavior, revenue concentration, tipping patterns, and trip efficiency. The results show that monthly demand peaked during 2014–2016 with more than 16 million trips per month, followed by gradual decline after 2017 and a major disruption in 2020 during the COVID-19 period. Taxi activity was highly concentrated in Manhattan and during afternoon-to-evening peak hours. Revenue was largely dominated by a small number of strategic pickup–dropoff borough pairs, particularly Manhattan-centered routes. Tipping behavior remained significant, with 62.96% of trips including gratuities. In addition, trips lasting 30–60 minutes provided the best balance between income opportunity and operational efficiency for drivers. For predictive analytics, a streaming batch training approach was implemented to handle more than 970 million trip records. Two incremental learning models, ElasticNet and Passive Aggressive Regressor, were evaluated using Root Mean Square Error (RMSE). The results indicate substantial improvement over the baseline model, reducing RMSE from 25.05 to 13.03 and 13.04, respectively. This represents an error reduction of approximately 48%. Overall, the findings demonstrate that combining big data platforms with online machine learning methods can effectively support urban mobility analysis, fare prediction, and data-driven transportation decision-making. The proposed framework is also adaptable for other smart city applications involving massive real-world datasets.
References
Armbrust, M., Ghodsi, A., Xin, R., Zaharia, M., & Franklin, M. (2021). Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics. CIDR Conference Proceedings. https://doi.org/10.48550/arXiv.2108.0903
Chen, Y., Zhang, X., & Li, H. (2023). Ensemble machine learning models for taxi fare prediction using trip records. Expert Systems with Applications, 213, 118947. https://doi.org/10.1016/j.eswa.2022.118947
George, G., Haas, M. R., & Pentland, A. (2020). Big data and management. Academy of Management Journal, 63(2), 321–326. https://doi.org/10.5465/amj.2020.4002
Jiang, S., Ferreira, J., & González, M. C. (2021). Transportation data analytics and urban mobility patterns. Transportation Research Part C, 129, 103234. https://doi.org/10.1016/j.trc.2021.103234
Kotu, V., & Deshpande, B. (2021). Data science: Concepts and practice (3rd ed.). Morgan Kaufmann. https://doi.org/10.1016/C2019-0-03743-0
Liu, T., Wang, J., & Sun, P. (2022). Forecasting urban taxi demand with spatiotemporal data mining. Sustainable Cities and Society, 76, 103456. https://doi.org/10.1016/j.scs.2021.103456
Marr, B. (2021). Big data in practice: How 45 successful companies used big data analytics to deliver extraordinary results (2nd ed.). Wiley. https://doi.org/10.1002/9781119642137
New York City Taxi and Limousine Commission. (2025). TLC trip record data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Rahm, E., & Do, H. H. (2020). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 43(4), 3–13. https://doi.org/10.48550/arXiv.2004.12045
Wang, J., Li, X., & Zhang, Y. (2022). Big data analytics in intelligent transportation systems: A review. Transportation Research Procedia, 62, 421–428. https://doi.org/10.1016/j.trpro.2022.02.053
Yuan, J., Zheng, Y., & Xie, X. (2021). Discovering urban mobility patterns from taxi trajectory data. ACM Transactions on Intelligent Systems and Technology, 12(4), 1–19. https://doi.org/10.1145/3447548
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2020). Apache Spark: A unified engine for big data processing. Communications of the ACM, 63(11), 56–65. https://doi.org/10.1145/3368089
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 International Journal of Information Engineering and Science

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


