Exoplanet research has been a popular field of astronomy for a few decades now. By definition, they are planets that aren’t in our solar system, either rogue, or orbiting other stars. Detecting exoplanets, however, is quite complex, and one of the favored methods of finding them is through the transit method.
Using machine learning algorithms, my group and I were able to accurately predict which stars housed exoplanets, and which didn’t.
The transit method deals with calculating the flux, or brightness, of other stars. This brightness is converted into pixels for a machine learning model to solve. If the brightness dulls over a certain time, and it rises again, it means that something passed by the star, dimming its light. A transit is exactly that.
To understand data, we need to understand why a machine learning model needs data to learn. In our project, we used a database of images from NASA, which consisted of transit patterns of stars with and without exoplanets. There were over 3000 of these images, which equates to…a lot of data!
A machine learning model is fed all of this data in order to learn and understand the patterns given by these images. This is known as “training data”. For our training data, we labeled the data given as either showing a light curve of a non-exoplanet transit, or that of an exoplanet. Once the model was trained with these images, it had a higher accuracy of distinguishing whether a new image showed the light curve of an exoplanet transit or not.
This can also be known as “testing data”. Testing data of a model is simply previously unused data that is input into the model to test the accuracy and make future predictions. In our case, testing data included data of unknown transits / flux patterns to allow the model to make an appropriate guess. Continually, the period of these fluxes is calculated. One period, in this case, is equal to 1184 pictures. This is valuable information to us because it allows us to see the frequency of light dips to be more confident on whether a passing object is an exoplanet or not. If these periods are occurring at constant intervals, this likelihood increases.
There are issues with the model, however. One is the fact that these light curves only apply to star systems with one star and one planet. Transits would look much more different for stars with multiple planets, or for circumbinary planets. Furthermore, there was a lack of data, with an excess of non-exoplanet data as compared to exoplanet data. This skew led to a high accuracy only because predictions were made to guess that a light curve represented a non-exoplanet. To fix this, we used the SMOTE method to generate fake simulation data. This generated data similar to the exoplanets’ light curves to close the gap in data, and the model’s accuracy increased. However, there was a reliability issue when it came to this model, because some of the data points varied in range.