Using arxiv to write academic titles (and more!)
This project is intended as a joke, highlighting it is possible to generate sensible souding (but entirely nonsensical) papers using arXiv. ArXiv is a depository of free to access scientific literature, typically targeted towards physics. Here, I use a neural network to predict the next character in a sentence after giving it a prompt for a title.
Examples include:
- spectroscopy of dielectric precision mechanism of the miximitationcicueration octahedral tilting
- calaculations of the tetragonal perovskite oxide
- calaculations of novel for higher nickel curee charge cluster growth
- understanding in frustration of lead-halide perovskitescompensated vacancy migration of ch3nh3pbi3 perovskite
- smells arising from solar ferrimentional dynamics in a likitando octahnique-induced perovskite
- electron microscopy of cu3-a cetationionic thermodynamical dynamics in equivolarity control experiment
These are all fantastically non-sensical, which I enjoy.
How it works:
First I use a JSON file kaggle
provides of all the metadata of arXiv. The code then selects anything that includes in the title "erovskite". This is to reduce the parameter set and could be anything that interests you. I have picked this to target papers on
perovskites , a type of solar cell.
stream=OpenRead@"arxiv-metadata-oai-snapshot.json";
(*use a dynamic array to capture the results incrementally*)
res=CreateDataStructure@"DynamicArray";
Monitor[lineNumber=0;
While[(line=ReadLine@stream)=!=EndOfFile,
title=ImportString[line,"RawJSON"]["title"];
If[StringContainsQ[title,"erovskite"],
id=ImportString[line,"RawJSON"]["id"];
res["Append",{id,title}];,
];
lineNumber++],lineNumber];
(*now convert the dynamic array into a list*)
res=Normal@res;
Export["training_titles.csv", res];
Now, I input format this training data - and import it such that I directly skip to this step in the future):
corpus = Import["training_titles.csv", "Data"];
corpus = StringJoin[corpus[[;; , 2]]];
corpus = ToLowerCase[corpus];
corpus=StringReplace[corpus,"/n"->" "];
keep =Join[Alphabet[],{" ","-","0","1","2","3","4","5","6","7","8","9"}];
corpus=StringDelete[corpus,Except[keep]];
sampleCorpus[corpus_,len_,num_]:=Module[{positions=RandomInteger[{len+1,StringLength[corpus]-1},num]},Thread[StringTake[corpus,{#-len,#}&/@positions]->StringPart[corpus,positions+1]]];
trainingData=sampleCorpus[corpus,10,200000];
Now, if we look at the training data, it simply gives the next letter, e.g.
RandomSample[trainingData,5]//Map[InputForm]//TableForm
will give the output
"ovskite cob" -> "a"
" initio cal" -> "c"
"reshigh pre" -> "s"
"e oxide com" -> "p"
"vskite via" -> " "
I now use Neural Networks to create a prediction algorithim.
characters = Union[Characters[corpus]]
n = Length[characters];
n=Length[characters];
predict=NetChain[{UnitVectorLayer[n],GatedRecurrentLayer[128],GatedRecurrentLayer[128],NetMapOperator[LinearLayer[n]],SoftmaxLayer[]}]
teacherForcingNet=NetGraph[<|"predict"->predict,"rest"->SequenceRestLayer[],"most"->SequenceMostLayer[],"loss"->CrossEntropyLossLayer["Index"]|>,{NetPort["Input"]->"most"->"predict"->NetPort["loss","Input"],NetPort["Input"]->"rest"->NetPort["loss","Target"]},"Input"->NetEncoder[{"Characters",characters}]]
result=NetTrain[teacherForcingNet,<|"Input"->Keys[trainingData]|>,All,BatchSize->64,MaxTrainingRounds->5,TargetDevice->"CPU",ValidationSet->Scaled[0.1]]
trained=NetExtract[result["TrainedNet"],"predict"]
predictNet=NetJoin[NetTake[trained,3,"Input"->Automatic],{SequenceLastLayer[],NetExtract[trained,{4,"Net"}],SoftmaxLayer[]},"Input"->NetEncoder[{"Characters",characters}],"Output"->NetDecoder[{"Class",characters}]]
obj = NetStateObject[predictNet]
So, if I ask it to predict the next letter after perovskit, we should expect it to output e, to finish the word perovskite. And so it does:
In: predictNet["perovskit"]
Out: e
Now, I can finish off sentences using this training data:
StringJoin@NestList[obj[#,"RandomSample"]&,"vibrational coherence ",100]
Out: "ion diffusion and field thermal effect in the band gaps layered perovskites".
Full git
here .