A Web Scraper for Data Mining Purposes

Yasir Ali Mahmood, Bassim Mahmood

Abstract


The current revolution in technology makes data a crucial part of real-life applications due to its importance in making decisions. In the era of big data and the massive expansion of data streams on Internet networks and platforms, the process of data collection, mining, and analysis has become a not easy matter. Therefore, the presence of auxiliary applications for data mining and gathering has become a necessary need. Usually, companies offer special APIs to collect data from particular destinations, which needs a high cost. Generally, there is a severe lack in the literature in providing approaches that offer flexible, low, or free of cost tools for web scraping. Hence, this article provides a free tool that can be used for data mining and data collection purposes from the web. Specifically, an efficient Google Scholar web scraper is introduced. The extracted data can be used for analysis purposes and making decisions about an issue of interest. The proposed scraper can also be modified for crawling web links and retrieving specific data from a particular website. It can also formalize the collected data as a ready dataset to be used in the analysis phase. The efficiency of the proposed scraper is tested in terms of the time consumption, accuracy, and quality of the data collected. The findings showed that the proposed approach is highly feasible for data collection and can be adopted by data analysts.

Full Text:

PDF

References


J. Lin, “A proposed conceptual framework for a representational approach to information retrieval,” SIGIR Forum, vol. 55, no. 2, pp. 1–29, 2021.

M. Khder, “Web scraping or Web Crawling: State of art, techniques, approaches and application,” Int. J. Adv. Soft Comput. Appl., vol. 13, no. 3, pp. 145–168, 2021.

Niu, Qingli and Kandhro, Irfan Ali and Kumar, Anil and Shah, Shahnawaz and Hasan, Muhammad and Ahmed, Hifza Mehfooz and Liang, Fei, “Web Scraping Tool For Newspapers And Images Data Using Jsonify,” Journal of Applied Science and Engineering, vol. 26, no. 4, pp. 465–474.

R. Diouf, E. N. Sarr, O. Sall, B. Birregah, M. Bousso, and S. N. Mbaye, “Web scraping: State-of-the-art and areas of application,” in 2019 IEEE International Conference on Big Data (Big Data), 2019.

V. Singrodia, A. Mitra, and S. Paul, “A Review on Web Scrapping and its Applications,” in 2019 International Conference on Computer Communication and Informatics (ICCCI), 2019.

J. Hillen, “Web scraping for food price research,” Br. Food J., vol. 121, no. 12, pp. 3350–3361, 2019.

K. N. Sharma S, “WEB SCRAPPING TOOLS,” Journal of Analysis and Computation., 2019.

A. K. Sharma, V. Shrivastava, and H. Singh, “Experimental performance analysis of web crawlers using single and Multi-Threaded web crawling and indexing algorithm for the application of smart web contents,” Mater. Today, vol. 37, pp. 1403–1408, 2021.

Y. D. Pramudita, D. R. Anamisa, S. S. Putro, and M. A. Rahmawanto, “Extraction system web content sports new based on web crawler multi thread,” J. Phys. Conf. Ser., vol. 1569, no. 2, p. 022077, 2020.

Range of web Crawling from HTTP Parse and HTML Requests as Static Digreph and Web Pages. .

Zhang N., Wilson S., and Mitra P., “STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents,” 2022, pp. 3461–3470.

Q. Niu et al., “Web Scraping Tool For Newspapers And Images Data Using Jsonify,” Journal of Applied Science and Engineering, vol. 26, no. 4, pp. 465–474.

J. M. Victoriano, J. P. Pulumbarit, L. L. Lacatan, R. A. S. Salivio, and R. L. A. Barawid, “Data analysis of Bulacan State University faculty scientific publication based on Google Scholar using web data scraping technique,” arXiv [cs.DL], 2022.

A. Rahmatulloh and R. Gunawan, “Web scraping with HTML DOM method for data collection of scientific articles from Google Scholar,” Indonesian J. of Inf. Syst., vol. 2, no. 2, pp. 95–104, 2020.

B. Buyuklieva and J. Raimbault, “Estimating bibliometric links using Google Scholar: A semi-systematic literature mapping of migration and housing,” arXiv [cs.DL], 2023.

D. Murillo, D. Saavedra, and R. Zapata, “Web application in Shiny for the extraction of data from profiles in Google Scholar,” in Proceedings of the 20th LACCEI International Multi-Conference for Engineering, Education and Technology: “Education, Research and Leadership in Post-pandemic Engineering: Resilient, Inclusive and Sustainable Actions,” 2022.

N. Ul Sabah, M. Murad Khan, R. Talib, M. Anwar, M. Sheraz Arshad Malik, and P. Nor Ellyza Nohuddin, “Google scholar university ranking algorithm to evaluate the quality of institutional research,” Comput. Mater. Contin., vol. 75, no. 3, pp. 4955–4972, 2023.

Sultan, N. A., & Abdullah, D. B., “Scraping Google Scholar Data Using Cloud Computing Techniques,” in 8th International Conference on Contemporary Information Technology and Mathematics (ICCITM), 2022, pp. 14–19.

Soriano-Burgos, C. I., Bautista, J. A., & López-Ramírez, M., “Obtención de una base de datos de perfiles de investigadores en Google Scholar basado en web scraping,” JÓVENES EN LA CIENCIA, vol. 18, pp. 1–3, 2022.

M. R. Rafsanjani, “ScrapPaper: A web scrapping method to extract journal information from PubMed and Google Scholar search result using Python,” bioRxiv, 2022.

B. Mahmood, Y. Mahmood, “Network-Based Method for Dynamic Burden-Sharing in the Internet of Things (IoT),” in International Conference on Emerging Technology Trends in Internet of Things and Computing, 2021, pp. 79–90.

N. A. Sultan, B. Mahmood, K. H. Thanoon, and D. S. Khadhim, “Network centralities-based approach for evaluating interdisciplinary collaboration,” in 2020 6th International Engineering Conference “Sustainable Technology and Development" (IEC), 2020.

B. Mahmood, N. A. Sultan, K. H. Thanoon, and D. S. Kadhim, “Measuring scientific collaboration in co-authorship networks,” IAES Int. J. Artif. Intell. (IJ-AI), vol. 10, no. 4, p. 1103, 2021.




DOI: https://doi.org/10.32520/stmsi.v13i3.4107

Article Metrics

Abstract view : 142 times
PDF - 34 times

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.