Web Crawler & Search Engine
- Published on
- Web Crawler & Search Engine · 4 min read
Web Crawler and Search Engine Project
Overview
As part of my coursework at UCI, I undertook a comprehensive project to develop a web crawler and a search engine. This project was divided into two main parts: building a web crawler to collect data from specified UCI domains and implementing a search engine to index and retrieve relevant information from the collected data.
Part 1: Web Crawler Development
Objective: Implement a web crawler to gather data from specified UCI domains:
- *.ics.uci.edu/*
- *.cs.uci.edu/*
- *.informatics.uci.edu/*
- *.stat.uci.edu/*
Key Tasks:
- Crawler Implementation: Installed necessary libraries and configured the crawler. Implemented a scraper function to parse web responses, extract information, and return valid URLs.
- Crawler Behavior: Ensured politeness by honoring delays, avoided web traps, and handled redirects and dead URLs. Transformed relative URLs to absolute URLs.
- Monitoring and Maintenance: Monitored the crawler to ensure proper functionality, addressed issues, and restarted as needed.
Analytical Deliverables:
- Unique Pages: Counted unique pages by URL, ignoring fragments.
- Longest Page: Identified the longest page by the number of words (excluding HTML).
- Common Words: Listed the 50 most frequent words, excluding stop words.
- Subdomains: Counted and listed subdomains in
ics.uci.edu.
Tools and Technologies:
- Libraries: BeautifulSoup, lxml, nltk, pandas, urllib.parse, collections, unidecode, numpy, langchain, configparser, argparse, threading
- Development Environment: UCI openlab machines and VPN for off-campus access.
Part 2: Search Engine Development
Objective: Implement a search engine to index and retrieve information from the data collected by the web crawler.
Milestones:
-
Milestone 1: Build an Index
- Goal: Create an inverted index for the corpus.
- Tasks: Design the structure of the postings, calculate term frequencies, and modularize the indexing process.
- Deliverables: Submit code and a report detailing the number of documents, unique tokens, and the total size of the index.
-
Milestone 2: Develop a Search Component
- Goal: Implement a basic search functionality.
- Tasks: Handle boolean queries, sort retrieved documents using tf-idf scoring, and perform cosine similarity calculations.
- Deliverables: Submit code and a report with the top 5 URLs for specified queries and a picture of the search interface in action.
-
Milestone 3: Complete Search Engine
- Goal: Improve and finalize the search engine.
- Tasks: Evaluate the search engine with a set of queries, optimize performance, and generalize heuristics.
- Deliverables: Submit all programs, a README file, and a TEST file. Conduct a live demonstration for the TAs.
Features:
- Indexing: Created an inverted index considering term frequencies and tf-idf scores. Highlighted important text from HTML tags.
- Search Interface: Implemented a console prompt for user queries. Utilized tf-idf for ranking, with additional components for improved retrieval.
- Extra Credit: Added features such as duplicate page detection, PageRank, n-gram indexing, word position indexing, and a GUI for enhanced user experience.
Tools and Technologies:
- Languages and Libraries: Python, nltk, pandas, BeautifulSoup, urllib.parse, collections, unidecode, numpy, nicegui, langchain.
- Development Environment: Utilized GitHub for version control and collaboration.
Challenges and Solutions:
- Efficient Data Structures: Designed data structures to handle large datasets efficiently.
- Optimized Indexing: Implemented disk-based partial indexing to manage memory usage.
- Responsive Search: Achieved fast search response times under specified constraints.
Results:
- Successfully crawled and indexed a large collection of web pages.
- Developed a functional search engine with efficient retrieval and ranking mechanisms.
- Conducted a comprehensive evaluation and optimization of the search engine.
This project honed my skills in web scraping, data extraction, and search engine implementation, providing valuable experience in handling large-scale data operations and ensuring efficient and effective retrieval of information.