Weekly Report 3 - Feb 19-23 2018

23 Feb 2018

Amlaan

The main issue will be scraping the Yelp and Wunderground (weather) data. Cleaning the data will also require some clever tricks to avoid just dropping rows with null or invalid values.

Improved utility scripts to cache the sentiment dataset and preprocess the text, tokenize and tf-idf normalize for Scikit Learn Machine Learning Models
Setup oversampling techniques such as SMOTE for better sentiment analysis balance.
Setup training scripts for sentiment analysis.
Trained a rough sentiment embedding using FastText model to improve recognition of negative sentiment.
Trained deep learning models for Keras sentiment analysis:
- Attention LSTM
- Multiplicative LSTM
- Nested LSTM
- Neural Architecture Search RNN Cell
- Minimal RNN Cell
- CNN
- FastText
- MLP
Try to remove extreme positive bias from the following models:
- Logistic Regression
- Decision Trees
- Random Forest
Scraped data from Wunderground API for Weather

Scraped the data sets I wanted to gather.
Discussed more pipeline changes which seemed redundant after last week.
Tried out some visualization frameworks anf libraries like JS Charts, High Charts and learnt a little bit of d3.js to know what it is about.

Clean the scraped data and integrate it with our current directory structure
Try to establish the connections which were discussed by us between datasets and see if they are actually feasible. This will allow us to see any holes in our ideas as well as asymmetry between datasets schemas.
Discuss and implement atleast one Machine learning approach in addition to the other teammates efforts.

None other than usual snags in scraping from different sites according to their HTML structure and heirarchy.
Long tutorials for some simple concepts took some of my time which could have been done sooner with a better choice of tutorial to follow. No technical difficulty yet.

Updated project structure and core modules that will be built later
Looked at Census.Gov and CityofChicago Datasets and work related to Demographics information.
Tried some tutorials on data extraction.

Apply what has been understood for the data extraction.
Work with the team in scrapping the yelp data as per the specifications provided.
Learn more on visualization.

Nothing unprecedented, like pointed above the integration of data could need some extra effort.