Onward - A Platform for Change

supervised learning

recurrent networks

well logs

cnn

lithology

NPD

machine learning

Completed

FORCE: Machine Predicted Lithology

$0

Completed 183 weeks ago

0 team

This contest is developed in collaboration with FORCE.

The objective of the competition is to correctly predict lithology labels for provided well logs, provided NPD lithostratigraphy and well X, Y position.

Several software vendors offer the competition data with free licenses to their packages for people less familiar with Python. For the well log machine learning data you can use https://www.i2g.cloud/. https://www.dgbes.com/ offers a pre loaded project with full Python development environment that supports any machine learning model. This means you don't have to worry about data IO and data visualization. They also offer free licenses to all their specialized plugins. https://www.cegal.com/ offers you temporary free licenses to their Petrel python plugin for the time of the competition. Please contact support.geo@cegal.com if you would want to test a license.

Sponsors

The Leaderboard will go live September 1st 2020.

All questions and discussion related to the challenge should be submitted to the challenge specific discussion forum (under the "Discussion Tab").

Competition Details

Prizes are decided by the FORCE organizers and could include travel stipends or cash prizes. See the FORCE site for additional information.
The objective of the competition is to correctly predict lithology labels using provided well logs, provided NPD Lithostratigraphy and well location X,Y position.
The competition is scored using a penalty matrix. Some label mistakes are penalized more than others, see starter notebook and penalty matrix for details.
Starting September 1st, teams are invited to submit predictions on the open test data in the form of CSV files to enter a weekly leaderboard. The top ten teams on the leaderboard at the end of the competition are invited to submit code for final scoring. Submissions are limited to one submission per day per team or individual participant.
For final scoring, teams are required to submit a pre-trained model with Python code for predicting lithology labels on data in the same form as the open test data. The submission must contain a list of required dependencies (with version numbers) and any needed files.
The final ranking will be based on the score against the closed test dataset and the final rank will not necessarily be the same as the leaderboard ranking. The team with the highest value of the scoring function evaluated on the closed test dataset is the winner. The organizers reserve the right to disqualify submissions that cannot be easily run.
The 10 top teams participating in the final scoring are required to publish their model and training code under an Apache 2.0 license, and make it publicly accessible.
The predictions in the final scoring must be released under a CC-BY-4.0 license by all teams.
For this Challenge, Participants retain ownership over their submissions, but are expected to comply with the citation and release licensing detailed above.
To register as a team, contact support@xeek.ai and provide the email addresses of each team member. Teams must conform to the submission requirements (one prediction submission per team per day) and all team members must individually register for the challenge.
Teams must be limited to 4 participants or fewer.

Background Information

Previous Work and background information on lithofacies predictions from well logs is provided by reading through the references of the previous SEG machine learning competition (Hall and Hall 2017) (link to references and follow up work).

Here the link to the Github repository.

Here the link to the Hall publication 2017

Description of the Datasets

The provided dataset contains well logs, interpreted lithofacies and lithostratigraphy for 90+ released wells from offshore Norway. The well logs include the well name (WELL), the measured depth, x,y,z location for the wireline measurement as well as the well logs CALI, RDEP, RHOB, DHRO, SGR, GR, RMED, RMIC, NPHI, PEF, RSHA, DTC, SP, BS, ROP, DTS, DCAL, MUDWEIGHT. An explanation of the abbreviations is shown in the figure below.

As in real life, some (or many) logs will be missing for a given depth. Part of the competition is being able to deal with this in a machine learning setting. Missing logs as marked as NaN. The only features guaranteed to be present are DEPT and GR. Any model needs to handle missing columns as part of the prediction pipeline. The test data will also have missing columns. You can read more about the particular wells on the NPD factpages.

Training Data

The training data consists of well logs from 98 wells. For each log depth, the interpreted lithofacies class is given as an integer in the column FORCE_2020_LITHOFACIES_LITHOLOGY. An extra column named FORCE_2020_LITHOFACIES_CONFIDENCE (1: high, 2: medium, 3: low) is included by the interpreter. Teams may choose to use this in sample importance weighing during training, or ignore it.

Open Test Data

10 wells are included in an open dataset and will be used for leaderboards. The open dataset only contains the well logs, not the interpretations.

Closed Test Data

10-20 wells will be kept in a closed test dataset. The committee will run the participants’ submitted code on the closed test dataset for final scoring. Both logs and the interpretations on the closed test dataset is kept secret during the competition, but will be in the same format as the open test data. To ensure that your code will work for final evaluation, make sure it can be run and make predictions on the open test data.

Lithology Key

The lithology key for the curve “FORCE_2020_LITHOFACIES_LITHOLOGY” is shown below:

30000: Sandstone

65030: Sandstone/Shale

65000: Shale

80000: Marl

74000: Dolomite

70000: Limestone

70032: Chalk

88000: Halite

86000: Anhydrite

99000: Tuff

90000: Coal

93000: Basement

Scoring Function

Instead of penalizing each wrong prediction of the lithofacies similarly we decided to use a custom made penalty matrix derived from the averaged input of a representative sample of geoscientists. This allows for petrophysically unreasonable predictions to be scored by a degree of “wrongness”.

The scoring function used to evaluate predictions is given by

where A is the scoring matrix above, N is the number of samples, ŷ_i is the true lithology label and y_i is the predicted lithology label. See the starter notebook available on the "Data" tab for more details and example code for evaluating a prediction.

Penalty matrix

The penalty matrix is shown in the diagram below. It is recommended to take the penalty matrix into account either in model selection or in the optimization routine itself.

Screen_Shot_2020-08-04_at_9.52.40_AM.png

For convenience, the penalty matrix is provided as a numpy file on the Data tab.

Modeling Strategies to Consider

It is maybe worthwhile considering designing an algorithm that mimics the work of a petrophysicist, but this is entirely optional or may not even lead to a higher score. The log database that you have been given has been cleaned up as much as we could but it is not perfect. As such it is important that you first sort out which proportion of the data is really useful and which may be error prone. Example of erroneous data can be that the exact same log value occurs many times below each other or the values are outside physical bounds (spikes, recorded errors). After checking the data quality the petrophysicist would check the quality of the borehole. Washouts often severely affects log measurements and it is therefore very important to determine bad hole sections in order to reduce the weighting of the log values in the training. Developing a bad hole flag should be of interest to everyone.

Sample weighting

We have provided you with an interpretation confidence score that the well log interpreters assigned mostly qualitatively (1 high confidence that the interpreted lithology is present, 3 low confidence). This curve is provided for the training data only, and can be used during training to filter data, assign weights to samples, or in model selection

Stratigraphy:

It has been shown that using the stratigraphic interval as a feature in the ML framework typically leads to better results. That is of course not surprising in itself. We provide you with the public NPD Lithostratigraphy for the wells that we are releasing and will also have lithostratigraphy data for the blind wells that are not going to be released.

In order for machine learned lithofacies interpretation from well logs to be useful it would be advantageous if a high score could be achieved without the Lithostratigraphy as a feature or even better with the lithostratigraphy being predicted from the well logs themselves.

Augmentation

Petrophysicists base their interpretation on a the absolute values of the log curves as well as importantly on the pattern of the log curves over intervals of the wells ranging from a few samples to 10s of meters. Clever augmentation of the log curves and sequence tolerating machine learning frameworks (recurrent nets etc) could potentially improve lithofacies predictions.

Model Ensembles

One of the distinguishing features of this dataset is the presence of features (well logs) that will not always be there. Finding a way to incorporate the information from rare well logs is the key challenge. One way of tackling this problem is to build sub-models that are trained to infer lithology in different cases of log availability.

Lithofacies Boundaries

The exact cm picking of lithofacies boundaries is often subjective and a bit imprecise. One could consider weighting the samples immediately next to the lithofacies boundaries a bit lower in the weighting for the above reason.

Final Words

While we have made and effort to provide a fair dataset and assessment of the scores we are only humans and prone to mistakes and errors. We accept no liability for our mistakes.

Whatever the outcome of the competition the organizers decision is final and a recourse to courts is not permitted.