Proceedings of the International Joint Conference on Science and Engineering (IJCSE 2020)

The Design and Implementation of Web Crawler Distributed News Domain Detection System

Authors
I Gusti Lanang Putra Eka Prismana, Dedy Rahman Prehanto, I Kadek Dwi Nuryana
Corresponding Author
I Gusti Lanang Putra Eka Prismana
Available Online 24 November 2020.
DOI
10.2991/aer.k.201124.017How to use a DOI?
Keywords
Web crawler, news domain, distributed, focus crawler
Abstract

Spreading data or info through internet to increase the chances of success in a business through analysis of market trends is very common today. Web Crawl is one important thing, so that the incomplete data will not be appeared, and the data received is the most recent data. Exploration Web crawler technology is a technology that downloads web pages via a program. Crawlers and search engines face unpredictable challenges. A focused web crawl is essential for mining the unlimited data available on the internet. The web crawl encountered an undetermined latency issue due to their difference in response time. The proposed research tries to optimize the design and implementation of a distributed news domain detection system on a web crawler. This study proposes a distributed focused crawler because it reduces the appearance of time outs on each website, eliminates backlist capabilities, distributes resources and improves web crawlers work in efficient network bandwidth and storage capacity. The main objective of distributed theory Web Crawler implements crawler scheduling, sorting sites to define URL queues. The crawler is only focused on news data. This research implements URL Gate explorer, which is used as the main bridge of instructions from the database, URL Seed to check all URLs for each news, and get metadata to check each meta data whether there is the same title.

Copyright
© 2020, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the International Joint Conference on Science and Engineering (IJCSE 2020)
Series
Advances in Engineering Research
Publication Date
24 November 2020
ISBN
10.2991/aer.k.201124.017
ISSN
2352-5401
DOI
10.2991/aer.k.201124.017How to use a DOI?
Copyright
© 2020, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - I Gusti Lanang Putra Eka Prismana
AU  - Dedy Rahman Prehanto
AU  - I Kadek Dwi Nuryana
PY  - 2020
DA  - 2020/11/24
TI  - The Design and Implementation of Web Crawler Distributed News Domain Detection System
BT  - Proceedings of the International Joint Conference on Science and Engineering (IJCSE 2020)
PB  - Atlantis Press
SP  - 92
EP  - 97
SN  - 2352-5401
UR  - https://doi.org/10.2991/aer.k.201124.017
DO  - 10.2991/aer.k.201124.017
ID  - Prismana2020
ER  -