I recently implemented search functionality for my Hugo site, which can be seen at: https://imtorgdemo.github.io/pages/search/. The search uses lunr.js, an implementation of Solr. While it works, sufficiently, the metadata used for ranking queries could be improved. It would also be nice to visually locate the results by where resulting posts fit into the three data science fundamental disciplines: mathematics, computer science, and business. This narrative provides a quick solution for ranking posts by each discipline, then reducing the dimensions to 3 axes in the xy-plane.
Environment
Lets setup the environment for the basic scientific and NLP work.
import numpy as np
import nltk
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("This is a sentence.")
len(doc.vector)
300
import os
import sys
os.chdir('./Data/markdown')
os.listdir()
['blog_test-my_first_post.md', 'blog-logic_for_math.md']
os.chdir('/home/jovyan/PERSONAL/')
This is the basic metadata for each post.
! head Data/markdown/blog-logic_for_math.md
+++
title = "Building Math from the Ground-Up"
date = "2019-07-05"
author = "Jason Beach"
categories = ["Mathematics", "Logic"]
tags = ["nlp", "tag2", "tag3"]
+++
PreProcessing
We will use a post that focuses primarily on mathematics. So, we expect the ranking results to align with mathematics more than the other two fields.
file_path = "/home/jovyan/PERSONAL/Data/markdown/blog-logic_for_math.md"
with open(file_path, 'r') as file:
lines = file.readlines()
metadata = lines[1:9]
content = ' '.join(data[9:])
content = content.replace("\n","")
import re, string
pattern = re.compile(r'([^\s\w]|_)+')
content = pattern.sub('', content)
content = ' '.join(content.split())
Get word assocations from website. This is performed manually. In the future, scrape the site and get many more assocations.
#TODO:scrape the website import requests
url = ‘https://wordassociations.net/en/words-associated-with/TARGET?button=Search' url = url.replace(‘TARGET’,‘computer’) resp = requests.get(url)
noun_loc = resp.text.find(‘Noun’) resp.text.find(‘Adjective’)
#search: business
word_assoc_biz = 'MbaEntrepreneurshipRetailEntrepreneurConsultancyStartupAccountingSectorMarketingBankingCateringGroceryInvestingWhartonEnterpriseLendingStakeholderCustomerEconomicFinanceConsumerCommerceConglomerateEconomicsBakeryInvestmentInsuranceManagementSupplierMarketplaceInvestorVentureFirmTelecomTradesmanPayrollManufacturingBrokerTransactionLumberRetailerProfitFinancingContractingSustainabilityPartnerInnovationHospitalityNetworkingAccountantExecutiveIncentivePartnershipProcurementShareholderEmployeeAssetUndergraduateIndustrialPhilanthropyEquityLiabilitySmallAdvertisingInformaticsSalesTourismRecessionLeisurePurchasingConsultantOwnerHaasAccreditationMercantileWholesaleProfitableLucrativeUnfinishedRetailThrivingRiskyConsultingCorporateMultinationalBoomingNonprofitGraduateAccreditedPhilanthropicFinancialSustainableAutomotiveBankruptUrgentDiversifyDivestInvestProsperRestructure'
tmp = re.findall('[A-Z][^A-Z]*', word_assoc_biz)
words_biz = ' '.join(tmp)
#search: software, programming
word_assoc_cs = 'SimulcastOptimizationDualityIntegerSynthesizerApiSynthNewscastKeyboardCwProgrammerCompilerPythonBasicKeywordAiringAffiliateSyndicationDecompositionAlgorithmJavaUhfUnixSemanticsParadigmInterfaceRecourseConstraintArrangerAutomatonFccPascalSyntaxBroadcastingBroadcastCbcNickelodeonApproximationNetworkAffiliationPbsSemanticRelaxationInstrumentationLineupHdBrandingLanguagePercussionEmmyLogicFmIdeAbcChannelTelecastDrumCableForthBbQuadraticStochasticNonlinearWeekdayFractionalOrientedLinearConvexJavaOptimalDynamicSequentialDaytimeScriptedConcurrentConstrainedPrimalConcaveImperativeGraphicalRetroProceduralAiredObjectiveFuzzyAnalogPolynomialSyndicateAirNetworkProgrammeBroadcastGeneralizeStructureRelaunchMixnuHardwareLinuxCadPackageDeveloperMacintoshVendorWorkstationUnixAutomationProgrammerAmigaEncryptionFunctionalityServerVisualizationBrowserIbmAdobeCompatibilityComputerOsComputingAtariPcApiGuiInterfaceUserGraphicsNetworkingLicenseSimulationRepositoryModelingXpMidiModemUpdateVerificationRouterCompilerToolProcessorEditingSimulatorMultimediaApplicationProviderLaptopUpgradeCpuNokia3dMetadataMicroprocessorApacheStartupPiracyAppIntelValidationSuiteOptimizationCiscoKernelModellingDocumentationGraphicHackerImplementationConsultancyVulnerabilityTcpEmailGps'
tmp = re.findall('[A-Z][^A-Z]*', word_assoc_cs)
words_cs = ' '.join(tmp)
#search: mathematics
word_assoc_math = 'PhysicAlgebraMathematicCalculusMathematicianPhysicsGeometryOlympiadAstronomyHilbertTopologyGraderPolynomialManifoldProficiencyInformaticsBscTheoremAxiomEulerMathMechanicPhdGeneralizationIntegralAptitudeProfessorshipGenealogyBsChemistryTextbookStatisticEmeritusComputationSpringerLogicDescartesDoctorateScienceMechanicsCurriculumMultiplicationBachelorProfessorUndergraduateSubgroupIntegerConjectureNeumannGraduateComputingLecturerSummaBiologySubsetAstrologyFourierExamPedagogyCantorTensorPhilosophyCalculatorPermutationMatriceAlgebraicMathematicalTopologicalProjectiveProficientEuclideanArithmeticAppliedDifferentialManifoldDiscreteOrthogonalComputationalAnalyticPolynomialInvariantNumericalGeometricFiniteQuadraticStochasticGradeGeometricalStudiedSymmetricBabylonianTheoreticalDegreeAbstractTextbookEmeritusGraduate'
tmp = re.findall('[A-Z][^A-Z]*', word_assoc_math)
words_math = ' '.join(tmp)
path = './Data/markdown/'
file = path + 'word_assocation_ref.json'
words = {"math": words_math, "cs": words_cs, "biz": words_biz}
import json
with open(file, 'w') as fp:
json.dump(words, fp)
with open(file, 'r') as fp:
new_words = json.load(fp)
math = nlp(words_math)
cs = nlp(words_cs)
biz = nlp(words_biz)
Similarity
These results use the cosine similarity between the fields and all the word-embeddings of terms in the document. They are not what we expect to see because math is ranked lowest, despite the document using math as the primary subject.
We can probably do better by removing unneccessary stop words and taking the most ‘important’ words in the document. The most important terms can be defined using the TF-IDF formula that is typical in ‘bag-of-words’ NLP approaches.
We will use the sklearn
library for the simple calculations.
# Compare two documents
doc1 = nlp(content)
print(doc1.similarity(biz))
print(doc1.similarity(cs))
print(doc1.similarity(math))
0.5729121134078637
0.6058354478834562
0.48928262314068244
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df=3, analyzer='word', stop_words = 'english', sublinear_tf=True)
tfidf.fit(content.split(' '))
feature_names = tfidf.get_feature_names()
def get_ifidf_for_words(text):
tfidf_matrix= tfidf.transform([text]).todense()
feature_index = tfidf_matrix[0,:].nonzero()[1]
tfidf_scores = zip([feature_names[i] for i in feature_index], [tfidf_matrix[0, x] for x in feature_index])
return dict(tfidf_scores)
scores = get_ifidf_for_words(content)
sorted_scores = {k: v for k, v in sorted(scores.items(), key=lambda item: item[1], reverse=True)}
from itertools import islice
def take(n, iterable):
"Return first n items of the iterable as a list"
return list(islice(iterable, n))
The results of the choosing important words by TF-IDF look much more promising. These are words you would expect to be associated with formal mathematics.
take(10, sorted_scores.items())
[('logic', 0.14527847977675443),
('logical', 0.140443796477262),
('form', 0.13793934097017202),
('predicate', 0.13687559493415327),
('use', 0.1342685066335781),
('language', 0.13265636975548645),
('symbols', 0.13265636975548645),
('truth', 0.13265636975548645),
('argument', 0.13077412529085744),
('inference', 0.13077412529085744)]
There are 82 words that are most important in describing the document.
len(list(sorted_scores.keys()))
82
When we compare the document’s most important words against the fields’ associations we find a much more compelling story. Now, Math is ranked the highest. Computer science is not far behind, but Business is rightfully quite different.
# Compare documents
words = ' '.join(list(sorted_scores.keys()))
doc1 = nlp(words)
print(doc1.similarity(biz))
print(doc1.similarity(cs))
print(doc1.similarity(math))
0.5238546677925497
0.6606328549347019
0.6888387753219556
Visual Location
We want to visually locate the document within a svg image that can be seen, below. This is quite unintuitive because the there are three axes within a 2D-plane. We must reduce the three dimensions to two.
There is no correct answer to this. In fact, there are approaches we could have taken, earlier, that would have completed this for us. A supervised clustering approach could have ensured the three groups.
But, we are keeping this simple and fast - no modeling. Instead of justifying a best solution, let us find the simplest method to reduce dimensions which is NOT incorrect. We can make the following assumptions:
- While the
similarity()
method returns cosine similarity with a range of 0-no similarity and 1-perfect similarity, the similarity in writing style leads us to expect an actual range of .50-.70 - x-dimension: Computer Science and Mathematics are antagonistic to each other (in a technical field perspective), but on a continuous scale between the two, so the two should be subtracted
- y-dimension: Business is discrete in it is addressed in the text, or not
We can use the generalized logistic function with subtracting Computer Science from Math, and say 0 is completely a CS paper, while 1 is completely a Math paper. We use an arbitrary B=25 to ensure a steep difference between the two.
#formula: Y = A + (K-A)/(C+Q*np.exp(-B*t))^1/v
# set A and B, with all other parameters set to 1
def general_logistic(Beta, cs, math):
t = math - cs
return (1/(1+np.exp(-B*t)))
xPt = general_logistic(25, .660, .688)
print( xPt )
0.6681877721681656
Because the actual range is closer to .50-.70, we can expect .60 to be decisive line with value greater than meaning applicability. So, the y-axis will be 0-top and 1-bottom, with the top of the Business set at .60. A similarity value lower than this number means the paper is not in this set.
yPt = doc1.similarity(biz)
print( yPt )
0.5238546677925497
Prototype
Lets complete the prototype with some frontend work using D3js.
BizPt = doc1.similarity(biz)
CsPt = doc1.similarity(cs)
MathPt = doc1.similarity(math)
xPt = general_logistic(25, CsPt, MathPt)
yPt = BizPt
metadata.insert(4, f"location = [{xPt}, {yPt}]\n")
metadata
['+++\n',
'title = "Building Math from the Ground-Up"\n',
'date = "2019-07-05"\n',
'author = "Jason Beach"\n',
'location = [0.6693281622812353, 0.5238546677925497]',
'categories = ["Mathematics", "Logic"]\n',
'tags = ["nlp", "tag2", "tag3"]\n',
'+++\n',
'\n']
combined = ''.join(metadata) + content
file_path = "/home/jovyan/PERSONAL/Data/markdown/result.md"
with open(file_path, 'w') as file:
file.write(combined)
Once the transformed data exported to markdown files, it can be indexed by lunrJs. The results of the search query can load both the post information, and the location can be used with the svg.
beakerx.point = {"xPt":xPt, "yPt":yPt}
from beakerx.object import beakerx
%%javascript
require.config({
paths: {
d3: '//cdnjs.cloudflare.com/ajax/libs/d3/4.9.1/d3.min'
}});
<IPython.core.display.Javascript object>
%%javascript
beakerx.displayHTML(this, '<div id="fdg"></div>');
var point = beakerx.point
var d3 = require(['d3'], function (d3) {
var width = 300,
height = 200;
var svg = d3.select("#fdg")
.append("svg")
.attr("width", width)
.attr("height", height)
.attr("transform", "translate("+[100, 0]+")")
var node = svg
.append("circle")
.attr("class", "dot")
.attr("r", 10)
.attr("cx", 150)
.attr("cy", 100)
.style("fill", "Blue");
});
<IPython.core.display.Javascript object>
Conclusion
The final result of the script and D3 can be at: https://imtorgdemo.github.io/pages/search/.