Register
peter-olexa-MrLzfoGWr3M-unsplash.jpeg
learning
beginner
Completed

(In)Direct from the Source

$2,000
Completed 114 weeks ago
0 team

Goal of Challenge

We want contestants to walk away from this challenge, having learned how to use Python to work with geoscience data. The tools will be simple to understand and research but have practicalities with real data and provide a foundation for further learning and growth.  

The tools/methods that will be used in the challenge are:

  • Cleaning and normalizing a table of data 

  • Creating a synthetic training set based on sampling real data

  • Building a simple ML model

Other tools are featured in this challenge, but they are secondary to completing the challenge. 

Background

Geochemistry data is a powerful component in understanding prospectivity within a basin. BUT the data is also messy and fraught with sampling errors, different processing methodologies, and the full range of challenges that come with data about Mother Earth.  

Contestants will be given a table of data from the imaginary Pluto Basin and asked to identify each oil sample's origins. To solve this challenge, contestants will also need to integrate different pieces of geological information as well as hone their data science skills.

Data

The contestants will be given a series of data points to solve the challenge:

  • Geochemistry samples from the Pluto Basin

  • Global ranges for known oil families

  • "Kitchens" map for major source rocks

  • Notes from a Geochemist

Geologic Premise

You’ve just been assigned to explore for hydrocarbons in the Pluto Basin.  The Pluto Basin has several discoveries, but many dry holes. To your astonishment none of the legacy work states the source for this oil.   From your experience you know that by understanding the location, maturity, and type of source rocks in a basin you can be more predictive about what works and fails within the basin.  Therefore, you first decide to interpret oil families for the basin.  Oil families analysis is done by looking at chromatograph trace data of oil samples and matching it to known types of oils (i.e., a marine vs. shallow lagoon).  Each oil family will have a fingerprint.  By plotting these oil families on a map a geoscientist can then make conclusions about how hydrocarbons are migrating through the system.

As you just started on this project, you only have a few pieces of information:

  • Published ranges for different oil family types

  • ~80 oil samples samples

  • Megasequence cross section through the Pluto Basin

  • Map of samples and key geologic features.

Getting Started

The Starter Notebook will walk you through the process of building a ML model in Python.

The contestants will have to go through several steps in order to predict oil families: 

  1. Inspect and clean the geochemistry samples

  2. Generate synthetic data

  3. Build an ML model

  4. Quality check output

  5. Check the answer with Xeek

Let’s step through these individually.

Cleaning

The oil chromatograph data table is messy. The data contains several issues, including 0 values randomly dispersed throughout the dataset, different scales of numbers, and spikes in the data (due to oil-based mud). These issues are all called out in “Note from a Geochemist.” How these problems are solved will be up to each contestant.

Generate Synthetic Data

The data table has no labels and only a few samples, not enough to train an ML model. The contestants will have to use the ranges from global oil families to build a synthetic dataset. They are given the distributions for the chromatograph results for five such families (see Starter Notebook files).

Build ML Model

Once synthetic data has been generated, then an ML model can be trained to predict oil families. First, the model must be able to predict a blind test on the synthetic data successfully. Once a model has been satisfactorily trained, it can be run on the challenge data to predict a label.

Quality Control Output

The input data and the synthetic training data are imperfect, so it shouldn't be a surprise that the first round of predictions is also imperfect. Luckily, we can integrate other pieces of geologic information to get close to a perfect answer.

Check Answer on Xeek

Contestants can submit their predictions as CSV files through the Xeek website to get a score. This score shows how close their predictions are to the actual (fictional) answer.