HOME | classsuccess

Project

As described in the "About" section, our project revolves around building a class recommendation program. Before designing the top-level program or writing code for specific tasks, our team had to collect data. We collected data by asking multiple people and classes to take a Google Forms survey (https://docs.google.com/a/umich.edu/forms/d/e/1FAIpQLSdSkuVtOiTDbAKaI3cuNytgedlooNocNLwuQhUgMDW7xqdXNg/viewform). This survey asked for the required information of each student's class standing, major, degree specialization, classes taken, how much the student enjoyed each class, and the grade received in that class. Our team's goal was to receive a minimum of 100 responses, and currently we have 117 valid responses. Once the appropriate data was acquired, we parsed through the data to create different files associated with different data, i.e. class standing, major, specialization, different EECS classes taken amongst all students in the data set, the rankings from each student for each class, and the grade received for each student in each class. This project is based on the idea that a user will input the necessary information, which generates a vector, or signal, and that signal will be compared with the signals, or vectors, in the data set to output information that best fits and completes the user's inputs.

The program takes an input matrix of student data made up of grades and/or ratings and information about the student (a number representing their class year, major, and specialization). Every row of the inputted matrix represents a student and every column represents a class. The matrix is filled with either the students' ratings or grades. Any class the student has not provided information about is filled with a zero; this is a hole in the data. The input matrix is then completed using Singular Value Decomposition or averaging as a predictor. The classes with the highest value that the student has not already taken are output into a recommendation vector.

CONTACT

For any and all inquiries about our project, you may contact one of us at our respective email address:

Hannah Noble:

nobleh@umich.edu

Andrew Pearl:

andpearl@umich.edu

Jason VandenBerg:

jvandey@umich.edu

SVD/Matrix Completion:

http://theory.stanford.edu/~tim/s15/l/l9.pdf

http://theory.stanford.edu/~tim/s16/l/l9.pdf

https://math.dartmouth.edu/archive/m56s13/public_html/Southworth_proj.pdf

LOO-CV:

https://www.cs.cmu.edu/~schneide/tut5/node42.html

Bias in Surveying:

http://businessstatistics.us/resources/statistics-supplements-2/business-statistics-samplin.pdf

References:

11/20/2016 Progress Report

Thus far our team has collected the necessary data and we have begun to parse through it and organize the data into appropriate vectors/signals. As seen on the right, we have acquired and analyzed each student's rating of the EECS classes submitted thus far.

The top plot on the right is merely a plot of every student's rating for the classes. If a student did not list one of the classes that other students listed, the rating for that class from that student was listed as a zero. This zero-padding technique will also be used with the user inputted information through the GUI. For example, if the list of classes from the data set included EECS 301 but the user does not list EECS 301 as a class s/he has taken, the program will append a rating of zero to the user's inputted signal for EECS 301.

The middle plot is a plot of the FFT of the ratings matrix. Looking at the FFT, not much information can be acquired for each student about his/her class ratings.

The bottom plot is a plot of the average rating for each class. This was acquired by adding together all ratings in each column and dividing the resulting sum by the number of nonzero ratings in each respective column. The code that produced the three plots on the right is shown on the left.

We have encountered some difficulties with loading our data, and there are a few complications with our data in general. In the beginning, when the data is downloaded as a .csv file from the Google Forms website, the classes, ratings, and grades (submitted in the format <class,rating,grade>) are not all in their own columns. Parsing through that data has been more difficult than expected. Our data itself has also proven to be more complicated than expected. Due to human error, some respondents entered data either in the incorrect format or submitted class information for a class not offered through EECS/does not exist (e.g. ENGR101 or EECS 201). These discrepancies in the data will have to be dealt with as soon as possible in order to ensure the most appropriate data set is being used.

Over the next three weeks, there is much to be done:

The most important and immediate task is segregating all the data (e.g. grades, class standing) and representing all the data as the same type to do appropriate comparisons in Matlab.
The second task is creating a program to compare an inputted signal with the signals we have created from our data set.
The third task we must accomplish in the next three weeks is creating a user-friendly GUI that integrates the separate components of our project together.

One important thing we have all learned on this team is how quickly machine learning can develop complexities. Our project is a very low-level form of machine learning and simply relies on data comparison, not necessarily any extrapolation or interpolation. Researching more into machine learning, statistics, and data manipulation, we have seen and are experiencing the reality that machine learning is no easy task, regardless of desired tasks.

ABOUT

Welcome to the website of EECS 351 team Class Success! This website hosts all information that one would need to know about our project. We are a team of three students who are currently enrolled in EECS 351: Hannah Noble, Andrew Pearl, and Jason VandenBerg. Our team was tasked with coming up with an original project idea that involves Digital Signal Processing (DSP) techniques we have learned in class. Although there are a myriad of applications of DSP, our team decided to link our project to machine learning. Our project idea is a recommendation system. Our program takes a student's class history and recommends what classes they would be successful in by comparing their past performance and preferences to other students' past performance and preferences.

SERVICES

ABOUT

PROJECTS

CONTACT

EECS 351 Team CLASS SUCCESS

Team members: Hannah Noble, Andrew Pearl, Jason VandenBerg

Data Collection and Calculation

SVD Cases

Basic Cases
- Ratings - Matrix of only ratings information
- Grades - Matrix of only grades information
- Combined - Matrix of both ratings and grades information (twice as wide)
Cases including Student Information
- All student information included
- Only class year included
- Only major included
- Only specialization included
- Combinations of two piece of student information
Cases with a cut off threshold of students
- Matrix of classes taken by at least n students
  - 100 Students
  - 50 Students
  - 40 Students
  - 20 Students
Combinations of these different cases

k-Means Cases

Clustered data with k-means
- Used LOOCV to find best k number of cluster for each case
- Predicted using the mean of the column in the cluster
Repeated on representative cases
- All basic cases
- Cases with all student information included

Control Cases

Predicting with simply the average of the columns
- Complete for all three basic cases
Predicting a randomly generated matrix with average of columns
- Complete for matrices the size of all three basic cases

Representing A+’s

All cases were run for A+ = 4.0 and A+ = 4.3

Click here to see a complete list of the error associated with each matrix we tested in our algorithm

Summary of Cases

Overall, the absolute error associated with using SVD to predict grades/ratings was very high. We tried running SVD matrix completion on different variations of our data. All data was normalized to one. The best case using SVD completion was achieved by combining grades and ratings data in a matrix of classes taken by over 100 students that included the students’ class year.

All error was calculated with leave one out cross validation. Every data point was left out one at a time, the empty space was predicted using whatever method we were testing (SVD or averages), and the absolute error was calculated for that point. The average absolute error was then found for the particular prediction method on the particular data set.

To get an idea of what that error meant for the SVD prediction method, it was compared to simpler methods: predicting using the average of the column instead of SVD and completing LOO-CV on a randomly generated matrix of the same size. Taking the average of every column worked better than every SVD prediction we performed, and the error of the randomly generated matrix worked better than all of the SVD prediction except for matrices of classes taken by over a hundred students with some amount of student information included.

In order to try to improve the error, clustering was tried on the data using k-means. Predictions were made by taking the mean of the column in a clustered group. In every case of using clustering, the best number of clusters to use was one. This produced the same error as simply taking the average of every column since there was only one cluster. This may have happened since there are so many holes in the data set. Clustering being ineffective could also point to students’ data not being highly correlated to other students’ data.

Summary of Analysis

Data Analysis

Students who don’t share classes:

There was one student who had only taken one class, and no one else had taken that class. We originally thought we couldn’t provide a recommendation for him since he was linearly independent of every other student.

In matrices that don’t include any student information, we weren’t able to provide recommendations for this student (student 84), and we weren’t able to recommend this class (595) to anyone else.

When we included student information, student 84 was no longer linearly independent, and we were able to provide him with a class recommendation.

Classes that are recommended frequently:

When a student has not taken one of the more popular classes (eg. requirements like 215, 216, 280, 230, & 301), they are likely to be recommend one of those classes. However, we do not see these classes being individually being recommend frequently since most people have taken them.

We don’t see many of the four and five hundred level classes being recommended. Not many people have taken them, so there is less chance that a student overlaps with another student that has taken one of the less popular classes.

Analysis of Students & Recommendations:

Analysis of Students and Recommendations

If the data set had been larger and the error associated with too small of a data set was diminished, we would expect the following errors associated with our survey to be more significant:

Sampling bias:

In the data we received from students most of the data we received was in the range of B- to A+. We received very few C+’s or worse; in fact, we only got one non passing grade.

Another form of sampling bias is Non-Response Bias. We believe this is what caused our lack of responses of students with poor grades. No student wants to admit that s/he has bad grades.

Our last form of bias is Selection Bias. We sent our survey to the University of Michigan’s Eta Kappa Nu Chapter, the EECS Engineering Honor Society which are typically filled with students who have higher GPAs. Hence, this led to our data being skewed.

Ratings being subjective:

Ratings are defined by an individual are subjective. Unlike grades, one person’s five isn’t necessarily equal to another’s five. Since we had a large error, it doesn’t affect our current project, but if we were going to take this forward we would want a way to standardize the ratings.

Systematic Error:

Systematic Error

Larger sample size:

We got 117 usable responses with a total of 57 classes taken. This means our basic grades or ratings vector was 117 x 57, and our basic combined vector was 117 x 114. There were a handful of classes that many people had taken (mainly core requirements), but many of the elective courses were taken by only a few people. Since no one has taken a large percentage of EECS classes we had a large number of holes in the data/zeros.

A larger sample size would give us more students instead of more classes since we have already included almost every EECS class. As more students are added and more classes are not added, there is more overlap between

Collect Data more Efficiently:

Rather than have students manually type their responses, we could provide a drop-down menu for each class
Provide a more objective way of rating a class as compared to a subjective rating of 1 to 10
- Ask students to choose between predetermined evaluations, e.g. selecting between the options of Strongly Agree, Agree, Neutral, Disagree, and Strongly Disagree for a statement
Develop methods for collecting more varying student information
- Perform random polling within the EECS building, asking bystanders to participate in the survey as they pass by
- Ask anyone who goes through the advising office to conduct the survey before leaving the office

Improvements:

Improvements

Topics from Class:

Bases
Linear independence and dependence
Orthogonal bases/vectors
Rank of a matrix/vector
N-dimensional vector spaces
- Each student in our data set is a single signal in a 57 (number of classes) dimensional vector space or a 114 dimensional vector space when there is grades and class data for a class
  - Each dimension represents a class

Topics from Outside Class:

Singular Value Decomposition
- Bases
- Rank of matrices/vectors
- Linear dependence/independence
k-Means Clustering
- Uses squared euclidean distance
  - Determines distance between vectors in n dimension
  - Finds distance between student vectors in the 57 or 114 dimensional vector space
  - Would be interesting to see if clustering were more effective using Chebyshev Distance, Mahalanobis Distance, or another way to measure the distance between vectors. Matlab k-means clustering defaults to euclidean distance.

Summary of Topics:

Summary of Topics

We predicted that there would be a threshold somewhere between a high threshold and a low threshold that would produce the best results. Smaller matrices are not filled in as well since a hole in the data is more significant than a hole in the data for a large matrix. Matrices with fewer holes in the data are better filled in by SVD. At high thresholds, the matrix is smaller but there are fewer holes. At low thresholds, the matrix is larger, but there are more holes. This means there is a tradeoff between threshold and number of holes, so we expected to see the best threshold somewhere in the middle.

We actually found that the largest threshold produced the smallest error.

This may have happened because there are so many holes in the data. Because of this, the effect of smaller matrices having significantly fewer holes heavily outweighed the effect of smaller matrices being less accurately filled in. This resulted in the largest threshold having the least error.

We predicted that the more student information that was included, the better the SVD prediction would be. We thought that data would be more varied and a basis to represent the data would need more linearly independent rows, and this would give a better prediction. Since every student had to respond to the student information questions, including student information adds three new columns to the data set without any zeros/holes. The higher percentage of the matrix that is filled in gives a better prediction. The more student information included, the higher percentage of the matrix is filled in, and the better prediction it gives.

We found that including three pieces of student data wasn’t significantly better than including one or two piece of other student information, but was significantly better than not including any student information. In some cases, including all of the student information wasn’t better than including only some of the student information

It would be possible to comment on why different student information was more useful if we had a much larger data set and could more clearly see trends.

We predicted that the combined vector of ratings and grades would yield the best estimate out of the three possible vectors since it contained twice as much data, and that grades and ratings would predict roughly equally as well as each other since they are the same size and contain the same amount of data.

We found that the combined vector did yield the best estimate as compared to just ratings and just grades. The Ratings matrix was a better predictor than the Grades matrix. We saw these trends continue for different variations on the original matrices.

- - Error for the combined vector was .387.
  - The error for Grades and Ratings vectors were .473 and .390, respectively

Due to more data being present in the combined vector than either the grades or ratings vectors, there's more information for matrix completion to work successfully and accurately. This could have been because a basis to represent each row of the data would have to be of a higher rank in the large combined vector as compared to the grades or ratings vector. There would need to be more linearly independent rows in this basis to represent the data.

This is supported by the best k values used in the SVD matrix completion:

- - For Combined, the best k value was 4
  - For Ratings and Grades, the best k value was 2

This shows that Combined was better represented by a larger compression of the original data. We saw the trend throughout our data that matrices best represented by higher k values had lower error.

Looking into why Ratings was a better predictor than Grades, there are 10 options of ratings and our data represent all ratings 1-10. Most grades were between a B and an A. All but one data point were between a C and an A+, so there are only 7 or 8 possible grades to input, and there was significantly less variance than the ratings. This could have caused ratings to be a better predictor. When we normalized grades, the lowest grade present was not a zero as it was in ratings. Normalizing the grades data to the min and max in the data set could be better than sticking to the GPA scale divided by the maximum GPA possible. This would spread the data out more between zero and one instead of seeing a clump at .75 (B) to 1 (A or A+).

We predicted that changing an A+ from a 4.0 grade point to a 4.3 grade point would reduce error due to introducing slightly more variation into the data set

We found that changing an A+ to a 4.3 grade point did reduce error, but the reduction was not large. However, it put the matrix of classes with over 100 students filled with just grades data and all student information over the threshold of being better than the corresponding random matrix.

- - error_G _100_SI = .265 for A+ = 4.0
  - error_G_100_SI = .247 for A+ = 4.3
  - error_GR = .252 (corresponding random matrix)

As we expected, the error was reduced due to an introduction of more variation within the data. However, since a 4.3 is not far off from a 4.0, the variations within the data set after normalization did not affect the cross validation significantly.

We originally thought that creating a ground truth of averages instead of zero would improve our results. Average padding and zero padding are both commonly used. We originally thought that since averages were closer to our desired output than zero, average padding would improve predictions.

This thinking was incorrect. All types of padding resulted in the exact same output.

Since no new independent information being added to the matrix, the value of the padding didn’t matter.

Class Recommender

Project

CONTACT

References:

11/20/2016 Progress Report

ABOUT

EECS 351 Team CLASS SUCCESS

Summary of Cases

Summary of Analysis

Analysis of Students & Recommendations:

Systematic Error:

Improvements:

Summary of Topics:

Grades, Ratings, Combined Vectors:

Student Information:

Thresholds:

Grade Weightings:

Ground Truth: