Comprehensive Guide to Elasticsearch Indexing and Retrieval

Slide Note
Embed
Share

Learn how to index, retrieve, and preprocess content with Elasticsearch. Explore techniques such as crawling with Heritrix, accessing Kibana, defining text preprocessing, testing Lucene analyzers, using file system (FS) crawler for indexing, and configuring FS crawler for efficient data ingestion into Elasticsearch.


Uploaded on Dec 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Indexing with Elasticsearch Agnese Chiatti and Lee Giles

  2. Recap Crawling the necessary contents with Heritrix Dumping the results in mirror format for HTML pages Retrieving the stored HTML pages from file system Indexing the crawled contents into Elasticsearch

  3. Accessing Kibana Open browser (from within VLabs or ISTNetwork wired computer) ist441giles.ist.psu.edu:56+<teamno> e.g., for team 1 ist441giles.ist.psu.edu:5601 for team2 ist441giles.ist.psu.edu:5602 and so forth

  4. Defining Text Preprocessing How to create a new index that filters out HTML tags and extracts/indexes just text ?

  5. Testing your Lucene analyzer On some sample text before indexing everything! References to Tokenizers and Filters in ElasticSearch 6.1 HTML strip char filter Elasticsearch Lucene Tokenizers

  6. File System (FS) crawler (Actually an indexer!) We will be using a plugin to ingest data into Elasticsearch It is actually an indexing system, do not be misled by the name! 1. Access the class server through PUTTY 2. cd /data/ist441/<teamno>/data-ingest/fscrawler-2.4 3. screen 4. bin/fscrawler ist441-test-1

  7. Configuring FS crawler 5. vim ~/.fscrawler/ist441-test-1/_settings.json 5. Press i to start modifying 6. url should point to the mirror folder where your crawled content is stored, for example: /data/ist441/team1/crawler/heritrix-3.3.0-SNAPSHOT/jobs/test1/latest/mirror

  8. Configuring FS crawler (2) 7. The update rate can be modified from minutes to days 8. Change elasticsearch host to ist441giles.ist.psu.edu 9. And the elasticsearch port to 92+<team-number> 10. At line 31, add index : ist441-test-1

  9. Configuring FS crawler (3) 11. The HTTP default port should be changed from 8080 to 808+<team-number> e.g. 8081 or 8090 if you are team 10, 8091 for team 11 and so forth 12. When your are done with modifications, press Esc + :wq 13. run bin/fscrawler ist441-test-1 again 13. Press Ctrl a + d to detach from screen

  10. Getting started with queries

Related


More Related Content