About Us

What is Polar Data Insights?

JPL and USC, under the direction of Dr. Chris Mattmann, have worked to collect a corpus of “deep web” polar datasets spanning many file types containing scientific data such as images, videos, and other information on the Web. These pieces of data were collected using Apache Nutch, Apache Tika, and Apache Solr.

Our goal is to aggregate this data into an intuitive search engine that scientists can utilize for polar research. Additionally, the data is analyzed and illustrated using visualization APIs Banana and D3.js, providing researchers a better understanding of the data's relationship within the Polar ecosystem.


Search Engine

Providing researchers with a powerful tool to find relevant data sets and websites.


Illustrating data set connections and related terms to narrow searches.


Demonstrate the value of these polar data sets to the NSF, USC, and NASA.


Meetings and Conferences

4 April 2017 Arctic Science Summit Week Prague, Czechia Presented in session "ARCTIC DATA AND INFORMATION SCIENCE MEETS SYSTEM SCIENCE"
24 July 2017 International Geoscience and Remote Sensing Symposium Ft. Worth, USA Presented in session "Intelligence for Big Geospatial Data"
16-18 September 2017 SAON - Arctic Data Committee Montreal, Canada Details
19-20 September 2017 Research Data Alliance Montreal, Canada Details
4-5 October 2017 NITR Open Knowledge Network Washington DC, USA Details
11-15 December 2017 Fall AGU Washington DC, USA Details
8 January 2018 Semantics Symposium Washington DC, USA Details
9-11 January 2018 ESIP Winter Meeting Washington DC, USA Poster Presentation
1-2 March 2018 1st U.S. Semantic Technologies Symposium (US2TS) Wright State University, Dayton, Ohio, USA Details
21-23 March 2018 Research Data Alliance Berlin, Germany Details
6-8 June 2018 Earthcube All Hands Meeting Washington DC, USA Details
17-20 July 2018 ESIP Summer Meeting Tucson, USA Details


Banana For Solr. Search Simplified

Search multiple keywords simultaneously for thousands of relevant URLs.

Add filters for more refined results using Banana's live-updating visualizations.

Go to the Banana Dashboard.

Generic placeholder image

D3.js. See for yourself.

View data sets from a variety of sources to better understand polar relationships.

View some of our visualizations.

Generic placeholder image

Facetview. Experience Solr.

Filter searches using facets and easily save, share, and consume documents from the Deep Web.

Go to the Facetview Dashboard.

Generic placeholder image

USC Data Science Projects

Apache Sparkler Post Processing using Machine Learning

This code gets connected to Solr DB created for Sparkler Crawled Data to do further data extraction, classification, filtering and insights generation using various Machine Learning models.

The ML models are capable of using keywords list from user, extract features from URL content, and classify (score) output and update Solr parameter accordingly.

Polar Deep Insights

The Polar Deep Insights project is a tool that can be used as generic content extraction and evaluation tool on any dataset.

It is a Dockerized Pipeline consisting of a content extraction, enrichment and rich visualization interface to explore the spatial-conceptual-temporal trec polar dataset and documents downloaded from ACADIS, AMD, and NSIDC websites crawled using Sparkler Web Crawler.We plan to use this to gain deep insights about climate change and its impact on the Arctic region.

Domain Relevant Data Collection using Google Search API

This project uses Google Search API to provide a list of most occurred urls based on domain keywords and phrases list. The code generates the phrases first based on the provided keywords and then uses them for searching.

After each search, top 10 urls(or all active & working URLs from the first page) are considered and added to a dictionary. Iterating through all keywords, the dictionary is finally sorted based on the frequency of occurrence.


A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j.

Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.

PDI Topics

LDA topic modeling for Polar Deep Insights.

Polar Domain Discovery

Domain Discovery on Polar Domain


This is a FacetView setup for ocean observation Crawled Data.


Chris Mattmann

Wayne M Burke

Ruth Duerr

Siri Jodha Singh Khalsa

Simin Ahmadi Karvigh

Omid Davtalab

Thamme Gowda

Nithin Krishna Ottilingam

Karanjeet Singh

Madhav Sharan

Srinidhi Nandakumar

Prerana Teligi Harapanahalli Math

Dixita Patel

USC Data Science Partner Sites

TREC/Data Description

The goal of the Text Retrieval Conference (TREC) is to encourage research in information retrieval from large text collections by providing interesting and understudied domains of documents to crawl.

Currently, the polar domains contains the NSF-funded Advanced Cooperative Artic Data and Information System (ACADIS), NASA-funded Antarctic Master Directory (AMD), and National Snow and Ice Data Center (NSIDC) Arctic Data Explorer. Our data was retrieved using these directories and submitted to TREC in 2015.

Polar Hack - November 2014

Hosted by the NSF, the goal of this hackathon was to implement visualizations of existing polar data sets to support new discoveries and promote cross agency collaboration between the NSF, NASA, NOAA and other Arctic/Polar related agencies.

Ultimately, the workshop fostered the understanding of the variability of the polar regions at different timescales, allowing the NSF to make longer-term investments in technologies and visualizations that can be adopted by the community.


The Information Retrieval and Data Science Group’s (I.R.D.S.) mission is to research and develop new methodology and open source software to analyze, ingest, process, and manage Big Data and to turn it into information.

We have expertise in data collection and contribute to the world's largest and most often downloaded open-source projects, working with NASA, DARPA, DHS, NIH across a number of domains, Earth Science, Planetary Science, Astronomy, defense, and private industry.


Dr. Chris Mattmann - Visit his website

CS401 Group (Lorraine Sposto, Jonathan Luu, Ruthvik Peddawandla, Titus Jung, Janet Kim)

CS599 Spring 2016 Class - Visit the class website

CS572 Spring 2015 Class - Visit the class website