first introduces the news website, XiKc, which is being done, and the website is http://s.xikc.com. The URL doesn’t make any sense. It’s just a domain name that was accidentally filed a year ago, just because it’s shorter and has only four letters.
website is the nature of the test, mainly want to try the two ideas, one is how to make the website can quickly update automatically a lot of content, because Google is sensitive to the update speed of the website, the website can be updated quickly get better rankings. Another idea is that I’ve always wanted to do a geography related news website because different places have different impacts on people from different places. For example, if you live in the District opened a small supermarket, this thing in this area may be very important, and for any other person, this news is meaningless, so news search for different place, should give different weights for news. Different users.
has only the first idea now. I’ve collected about ten thousand of the RSS sources online, including news, blogs, and so on. Then write a RSS interpreter (all the programs on the site are written in PHP), automatically read the contents of these RSS, and, if there is an update, save the update in the database. In this way, will find that the update is indeed very fast, about two weeks from within this ten thousand RSS reads the news content of about two hundred and fifty thousand, of course, that when not only the original connection, also shows that the source, I think this should be not what copyright problem, after all, you are RSS to facilitate the reprint. This process adjustment was interrupted during the last two weeks.
writes code, one of the most interesting questions to ask is how to determine the relevant news. Of course, if every article has a tag, then just search it by tag and get better results. The problem is that most articles don’t exist. I also thought the text similarity comparison idea, Google and Baidu may do so, but for a small personal website, obviously this complex calculation is not realistic, I get Dreamhost hang, estimates also not understood, ha ha. So we have to use the similarity of titles. And the Chinese title is not a word, can not think English, so first divided into words, the title in the filter, prepositions, quantifiers and other useless words. Moreover, if you choose the field in the title for alignment, the accuracy is not high (but now you have to use this method.).
is also the sort of hot topic. News sorting is still relatively simple, because you can increase the weight of time, always let the latest news at the top, and now I use this method. Hot topics or hot Tag should require complex algorithms. I wrote a program, run for more than 10 minutes, and think about this later, by hand, after all, this