artificial intelligence - Smarter than an Eighth grader? Kaggle AI Challenge. R -
i working on allen ai science challenge on kaggle.
the idea behind challenge train model using training data provided (a set of eighth grade level science questions along 4 answer options, 1 of correct answer , correct answer) along additional knowledge sources (wikipedia, science textbooks, etc) can answer science questions (average?) eighth grader can.
i'm thinking of taking first crack @ problem in r (proficient in r , c++; don't think c++ useful language solve problem in). after exploring kaggle forums, decided use topicmodels (tm), rweka , latent dirichlet algorithm (lda) packages.
my current approach build text predictor of sort on reading question posed outputs string of text , compute cosine similarity between output text , 4 options given in test set , predict correct 1 highest cosine similarity.
i train model using training data, wikipedia corpus along few science textbooks model not overfit.
i have 2 questions here:
does overall approach make sense?
what starting point build text predictor? converting corpus(training data, wikipedia , textbooks) term document/document term matrix help? think forming n-grams sources don't know next step be, i.e. how model predict , belt out string of text(of say, size n) on reading question.
i have tried implementing part of approach; finding out optimum number of topics , performing lda on training set; here's code:
library(topicmodels) library(rtexttools) data<-read.delim("cleanset.txt", header = true) data$question<-as.character(data$question) data$answera<-as.character(data$answera) data$answerb<-as.character(data$answerb) data$answerc<-as.character(data$answerc) data$answerd<-as.character(data$answerd) matrix <- create_matrix(cbind(as.vector(data$question),as.vector(data$answera),as.vector(data$answerb),as.vector(data$answerc),as.vector(data$answerd)), language="english", removenumbers=false, stemwords=true, weighting = tm::weighttf) best.model<-lapply(seq(2,25,by=1),function(k){lda(matrix,k)}) best.model.loglik <- as.data.frame(as.matrix(lapply(best.model, loglik))) best.model.loglik.df <- data.frame(topics=c(2:25), ll=as.numeric(as.matrix(best.model.loglik))) best.model.loglik.df[which.max(best.model.loglik.df$ll),] best.model.lda<-lda(matrix,25)
any appreciated!
Comments
Post a Comment