451 project

Unraveling UFO Sightings

Ridhi Mehra, Shloka Mohanty, Mrugank Pednekar, Mitanshu Thakore (Group - 4)

1 Introduction This project explores global UFO sightings since 1949, studying location, time, duration, and shapes. We employ k-NN, decision trees, logistic regression, and SVM methods to analyze correlations and predict features through trained classifiers.

2 Dataset

Source: https://www.kaggle.com/datasets/camnugent/ufo-sightings-around-the-world

The dataset consists of 66516 rows and eight columns, meticulously documenting worldwide UFO sightings since 1949. It includes precise dates, times, coordinates, shapes, and encounter durations. The ”is day” feature (1 for daylight, 0 for nighttime) reveals temporal patterns. A thorough cleaning ensures reliable, insightful analysis, unveiling correlations in UFO phenomena.

3 Implementing Machine Learning Algorithms

3.1 K-NN classification • Studying relation between Latitude, Longitude, and UFO Shapes

1. Owing to 28 different UFO shapes, we encoded them as labels with values between 0 and 27. 2. The dataset was split into Training, Validation, and Testing subsets. 3. Hyperparameter tuning: We ran a Grid search to find parameters for best performance:

metric=‘manhattan’ n_neighbors=7 weights=‘distance’

4. Model Assessment using accuracy metric for testing data:

Accuracy: 0.1362763915547025

5. We predicted the UFO shape for UW-Madison Campus.

clf.predict(np.array([[43.0766, 89.4125]])) >>> circle

• Studying relation between coordinates and if UFO was sighted during the day/night A similar approach was followed here, except for label encoding.

1. The following parameters were found to give the best performance:

metric=‘euclidean’ n_neighbors=7 weights=‘uniform’

2. Model Assessment using accuracy metric for testing data.

Accuracy: 0.744721689059501

3. Prediction for UW-Madison Campus.

clf.predict(np.array([[43.0766, 89.4125]])) #0:night, 1:day >>> array([0])

3.2 Decision Tree

Model 1: Predicting Day/Night using Latitude, Longitude, and Encounter Length

• Accuracy: 80.02% • Prediction for UW-Madison campus gives “Night” for all durations. • Decision trees excel in classifying fewer features.

latitude <= 40.808 entropy = 0.745 samples = 4165 value = [3282, 883]

latitude <= 33.717 entropy = 0.778 samples = 2671 value = [2056, 615]

latitude <= 41.633 entropy = 0.679 samples = 1494 value = [1226, 268]

latitude <= 21.311 entropy = 0.702 samples = 810 value = [656, 154]

longitude <= -111.87 entropy = 0.808 samples = 1861 value = [1400, 461]

longitude <= -81.891 entropy = 0.526 samples = 320 value = [282, 38]

latitude <= 42.28 entropy = 0.714 samples = 1174 value = [944, 230]

entropy = 0.996 samples = 13 value = [7, 6]

entropy = 0.692 samples = 797 value = [649, 148]

entropy = 0.879 samples = 483 value = [339, 144]

entropy = 0.778 samples = 1378 value = [1061, 317]

entropy = 0.275 samples = 190 value = [181, 9]

entropy = 0.766 samples = 130 value = [101, 29]

entropy = 0.821 samples = 207 value = [154, 53]

entropy = 0.687 samples = 967 value = [790, 177]

Figure 1: Model 1: Max Depth 3

Model 2: Predicting UFO Shape using Latitude and Longitude and Length of Encounter (MAXDEPTH=10, entropy) • Accuracy: 19.28% • Prediction for UW-Madison campus UFO Shape gives “Light” for all durations. • UFO Shapes are extremely random compared to Day/Night. Further investigation may be required to improve its performance.

length_of_encounter_seconds <= 3150.0 entropy = 0.745 samples = 4165 value = [3282, 883]

latitude <= 40.872 entropy = 0.758 samples = 3882 value = [3032, 850]

latitude <= 48.496 entropy = 0.52 samples = 283 value = [250, 33]

latitude <= 33.717 entropy = 0.793 samples = 2492 value = [1897, 595]

latitude <= 41.642 entropy = 0.688 samples = 1390 value = [1135, 255]

longitude <= -119.993 entropy = 0.478 samples = 272 value = [244, 28]

longitude <= -134.017 entropy = 0.994

samples = 11 value = [6, 5]

entropy = 0.714 samples = 754 value = [606, 148]

entropy = 0.822 samples = 1738 value = [1291, 447]

entropy = 0.524 samples = 296 value = [261, 35]

entropy = 0.724 samples = 1094 value = [874, 220]

entropy = 0.0 samples = 25 value = [25, 0]

entropy = 0.51 samples = 247 value = [219, 28]

entropy = 0.0 samples = 5 value = [5, 0]

entropy = 0.65 samples = 6 value = [1, 5]

Figure 2: Model 2: Max Depth 3

Model 3: Predicting Day/Night using Latitude and Longitude (MAXDEPTH=10, entropy)

• Accuracy: 80.20% • Prediction for UW-Madison campus gives “Night”. • This shows that the encounter duration has little correlation with whether a UFO appears during Day/Night.

length_of_encounter_seconds <= 210.0 entropy = 3.791 samples = 4165

value = [117, 82, 101, 403, 10, 11, 81, 1, 62, 253, 34 305, 79, 141, 867, 307, 209, 67, 278, 27, 437, 293]

length_of_encounter_seconds <= 5.5 entropy = 3.803 samples = 2189 value = [40, 61, 53, 187, 2, 7, 47, 0, 27, 123, 17, 208 56, 63, 409, 177, 103, 44, 142, 11, 256, 156]

latitude <= 35.685 entropy = 3.734 samples = 1976 value = [77, 21, 48, 216, 8, 4, 34, 1, 35, 130, 17, 97 23, 78, 458, 130, 106, 23, 136, 16, 181, 137]

latitude <= 43.734 entropy = 3.764 samples = 453 value = [3, 10, 8, 39, 0, 2, 7, 0, 4, 18, 4, 67, 32 11, 72, 38, 22, 9, 34, 4, 43, 26]

longitude <= -121.302 entropy = 3.783 samples = 1736 value = [37, 51, 45, 148, 2, 5, 40, 0, 23, 105, 13, 141 24, 52, 337, 139, 81, 35, 108, 7, 213, 130]

latitude <= 34.153 entropy = 3.703 samples = 653 value = [33, 4, 10, 79, 0, 1, 6, 0, 17, 52, 4, 37, 14 21, 134, 53, 33, 4, 40, 6, 60, 45]

longitude <= -82.26 entropy = 3.723 samples = 1323 value = [44, 17, 38, 137, 8, 3, 28, 1, 18, 78, 13, 60 9, 57, 324, 77, 73, 19, 96, 10, 121, 92]

entropy = 3.737 samples = 360 value = [1, 8, 8, 28, 0, 2, 5, 0, 3, 15, 2, 46, 28, 6 63, 29, 22, 9, 27, 3, 38, 17]

entropy = 3.534 samples = 93 value = [2, 2, 0, 11, 0, 0, 2, 0, 1, 3, 2, 21, 4, 5 9, 9, 0, 0, 7, 1, 5, 9]

entropy = 3.548 samples = 214 value = [4, 5, 9, 15, 1, 0, 5, 0, 5, 13, 0, 21, 2, 2 55, 30, 5, 5, 3, 2, 16, 16]

entropy = 3.791 samples = 1522 value = [33, 46, 36, 133, 1, 5, 35, 0, 18, 92, 13, 120 22, 50, 282, 109, 76, 30, 105, 5, 197, 114]

entropy = 3.662 samples = 475 value = [24, 3, 6, 68, 0, 0, 5, 0, 15, 31, 4, 27, 10 16, 104, 34, 18, 4, 27, 5, 47, 27]

entropy = 3.659 samples = 178 value = [9, 1, 4, 11, 0, 1, 1, 0, 2, 21, 0, 10, 4, 5 30, 19, 15, 0, 13, 1, 13, 18]

entropy = 3.674 samples = 861 value = [35, 12, 22, 83, 3, 1, 24, 1, 9, 39, 8, 36, 9 43, 227, 49, 43, 8, 61, 6, 76, 66]

entropy = 3.737 samples = 462 value = [9, 5, 16, 54, 5, 2, 4, 0, 9, 39, 5, 24, 0 14, 97, 28, 30, 11, 35, 4, 45, 26]

Figure 3: Model 3: Max Depth 3

3.3 Logistic Regression 1. Studying the Relationship between Length of UFO Encounter and Time of Day (Day or Night) • Divided data into Training, Validation, and Testing.

• Training the Logistic Regression Model. • Intercept and slope of the logistic equation:

intercept = -1.40938652 slope = -3.32652752e 08

• Accuracy:

0.8042693926638604

Our model outputs are shown in Figures 4a and 4b.

(a) Enlarged Graph (0-5000 seconds)

(b) Graph for all Data Points

Figure 4: Comparison of Graphs

Analyzing the two graphs and the slope of the logistic curve, w, which is very small (almost close to 0), it seems the log odds doesn’t depend on X. Equivalently, estimated P(y=1 — x) doesn’t vary with x. Hence, day/night is not being related to the length of encounter. 3.4 Support Vector Machine 1. Studying the Relation Between Latitude, Longitude, and Day/Night Prediction • The dataset was split into a training set (80 % of the data), validation set (10 % of the data), and test set (10 % of the data). • SVM was trained with RBF kernel for training. • Model performance metrics: Accuracy: 0.804 Root Mean Squared Error (RMSE): 0.885

• Decision boundary plots

(a) SVM with Decision Boundary (RBF Kernel) for Training Data

(b) SVM with Decision Boundary (RBF Kernel) for Testing Data

Figure 5: Support Vector Machine (SVM) Results

RMSE: 0.885. Predictions deviate by 0.885 units. SVM accuracy: 0.804, correctly predicts 80 . 41 Graphs suggest day/night, not solely location-dependent.

4 Conclusion Four machine learning methods were assessed. k-NN and decision trees had lower accuracy due to skewed string vectors, making them unsuitable. Logistic Regression outperformed, excelling in categorizing floating point values and achieving higher accuracy. However, it struggled to identify associations among vectors. This underscores algorithm selection’s importance in optimizing predictive modelling, especially with intricate feature connections and diverse data distributions.

Contributions

Member

Proposal Coding Presentation Report

Ridhi Mehra

1 1 1 1

Shloka Mohanty Mrugank Pednekar Mitanshu Thakore

Page 1 Page 2 Page 3 Page 4 Page 5

Made with FlippingBook - professional solution for displaying marketing and sales documents online