Jason Schwarzenberger
|
9bfc6fc6fa
|
scraper settings, ordering and loop.
|
2020-11-04 15:47:12 +13:00 |
|
Jason Schwarzenberger
|
6ea9844d00
|
remove useless try blocks.
|
2020-11-04 15:37:19 +13:00 |
|
Jason Schwarzenberger
|
1318259d3d
|
imply referrer is substack.
|
2020-11-04 15:21:07 +13:00 |
|
Jason Schwarzenberger
|
98a0c2257c
|
increase declutter timeout.
|
2020-11-04 15:15:00 +13:00 |
|
Jason Schwarzenberger
|
e6976db25d
|
fix tabs
|
2020-11-04 15:04:20 +13:00 |
|
Jason Schwarzenberger
|
9edc8b7cca
|
move scraping for article content to files.
|
2020-11-04 15:00:58 +13:00 |
|
Jason Schwarzenberger
|
d718d05a04
|
fix dates for newsroom.
|
2020-11-04 11:53:16 +13:00 |
|
Jason Schwarzenberger
|
9f4ff4acf0
|
remove unnecessary sitemap.xml request.
|
2020-11-04 11:22:15 +13:00 |
|
Jason Schwarzenberger
|
db6aad84ec
|
fix mistake.
|
2020-11-04 11:12:01 +13:00 |
|
Jason Schwarzenberger
|
29f8a8b8cc
|
add news site categories feed.
|
2020-11-04 11:08:50 +13:00 |
|
|
9a279d44b1
|
Add header to get content type
|
2020-11-03 20:27:43 +00:00 |
|
Jason
|
abf8589e02
|
fix sitemap
|
2020-11-03 10:53:40 +00:00 |
|
Jason
|
b759f46582
|
use extruct for opengraph/json-ld/microdata of articles
|
2020-11-03 10:31:36 +00:00 |
|
Jason Schwarzenberger
|
736cdc8576
|
fix mistake.
|
2020-11-03 17:04:46 +13:00 |
|
Jason Schwarzenberger
|
244d416f6e
|
settings config of sitemap/substack publications.
|
2020-11-03 17:01:29 +13:00 |
|
Jason Schwarzenberger
|
5f98a2e76a
|
Merge remote-tracking branch 'tanner/master' into master
And adding relevant setings.py.example/etc.
|
2020-11-03 16:44:02 +13:00 |
|
Jason Schwarzenberger
|
76f1d57702
|
sitemap based feed.
|
2020-11-03 16:00:03 +13:00 |
|
Jason Schwarzenberger
|
4e64cf682a
|
add the bulletin.
|
2020-11-03 12:41:16 +13:00 |
|
Jason Schwarzenberger
|
c5fe5d25a0
|
add substack.py top sites, replacing webworm.py
|
2020-11-03 12:28:39 +13:00 |
|
Jason
|
283a2b1545
|
fix webworm comments
|
2020-11-02 22:06:43 +00:00 |
|
Jason Schwarzenberger
|
0d6a86ace2
|
fix webworm dates.
|
2020-11-03 10:31:14 +13:00 |
|
Jason Schwarzenberger
|
f23bf628e0
|
add webworm/substack as a feed.
|
2020-11-02 17:09:59 +13:00 |
|
|
ca78a6d7a9
|
Move feed and Praw config to settings.py
|
2020-11-02 02:26:54 +00:00 |
|
|
e59acefda9
|
Remove Whoosh
|
2020-11-02 00:22:40 +00:00 |
|
|
cbc802b7e9
|
Try Hackernews API twice
|
2020-11-02 00:17:22 +00:00 |
|
|
4579dfce00
|
Improve logging
|
2020-11-02 00:13:43 +00:00 |
|
|
feba8b7aa0
|
Make qotnews work with WaPo
|
2020-10-29 04:55:34 +00:00 |
|
|
992c1c1233
|
Monkeypatch earlier
|
2020-10-24 22:30:00 +00:00 |
|
|
88d2216627
|
Add a script to delete a story
|
2020-10-03 23:42:21 +00:00 |
|
|
6cf2f01b08
|
Adjust feeds
|
2020-10-03 23:41:57 +00:00 |
|
|
6576eb1bac
|
Adjust content-type request timeout
|
2020-08-14 03:57:43 +00:00 |
|
|
472af76d1a
|
Adjust port
|
2020-08-14 03:57:18 +00:00 |
|
|
4727d34eb6
|
Delete displayed-attributes when init search
|
2020-08-14 03:56:47 +00:00 |
|
|
0e086b60b8
|
Remove business subreddit from feed
|
2020-08-14 03:55:28 +00:00 |
|
|
b46ce36c63
|
Update requirements
|
2020-07-08 05:24:32 +00:00 |
|
|
9a449bf3ca
|
Remove extra logging
|
2020-07-08 02:36:40 +00:00 |
|
|
0bd9f05250
|
Fix crash when HN feed fails
|
2020-07-08 02:36:40 +00:00 |
|
|
9c116bde4a
|
Remove document img and ignore r/technology
|
2020-07-08 02:36:40 +00:00 |
|
|
ebedaef00b
|
Tune search rankings and attributes
|
2020-07-08 02:36:40 +00:00 |
|
|
d7f0643bd7
|
Add more logging
|
2020-07-08 02:36:40 +00:00 |
|
|
f1c846acd0
|
Remove get first image
|
2020-07-08 02:36:40 +00:00 |
|
|
850b30e353
|
Add requests timeouts and temporary logging
|
2020-07-08 02:36:40 +00:00 |
|
|
d614ad0743
|
Integrate with external MeiliSearch server
|
2020-07-08 02:36:40 +00:00 |
|
|
f46cafdc90
|
Integrate sqlite database with server
|
2020-07-08 02:36:40 +00:00 |
|
|
873dc44cb1
|
Update whoosh migration script
|
2020-07-08 02:36:40 +00:00 |
|
|
1fb9db3f4b
|
Store ref list in database too
|
2020-07-08 02:36:40 +00:00 |
|
|
b923908a45
|
Begin initial sqlite conversion
|
2020-07-08 02:36:40 +00:00 |
|
|
dbdcfaa921
|
Check if cache is broken
|
2020-07-08 02:36:40 +00:00 |
|
|
8799b10525
|
Fall back to ref on manual submission title
|
2020-07-08 02:36:40 +00:00 |
|
|
6430fe5e9f
|
Check content-type
|
2020-07-08 02:36:40 +00:00 |
|
|
a4cf719cb8
|
Remove technology subreddit
|
2020-07-08 02:36:40 +00:00 |
|
|
595f469b4a
|
Update tildes parser group tag
|
2020-07-08 02:36:40 +00:00 |
|
|
7b31fcf690
|
Remove keys of uncached stories
|
2020-01-28 04:20:05 +00:00 |
|
|
b3d2eeb67f
|
Fix tildes deleted comment parser error
|
2020-01-28 04:19:26 +00:00 |
|
|
9078b567f0
|
Add del tag and sort tags
|
2020-01-04 23:37:41 +00:00 |
|
|
2822974b6e
|
Stop using archive.is on articles (hits CAPTCHAs)
|
2019-12-15 22:47:33 +00:00 |
|
|
17ef7e3a65
|
Whitelist more html tags
|
2019-12-14 07:39:10 +00:00 |
|
|
2d80b19414
|
Grab comments on manually submitted links
|
2019-12-02 23:15:51 +00:00 |
|
|
ebcbf1b624
|
Sanitize html
|
2019-12-01 22:18:41 +00:00 |
|
|
e231cd5c31
|
Decrease feed cache length to 150
|
2019-12-01 22:18:14 +00:00 |
|
|
db5097ac57
|
Drop articles more than two days old
|
2019-11-08 21:50:33 +00:00 |
|
|
2edb3ceba7
|
Allow manual submission of articles
|
2019-11-08 05:55:30 +00:00 |
|
|
38b5f2dbeb
|
Move to gevent production http server
|
2019-11-08 02:37:57 +00:00 |
|
|
6826f731c7
|
Handle hostnames better
|
2019-11-07 22:10:08 +00:00 |
|
|
bb693ba434
|
Add subreddit
|
2019-11-07 22:09:45 +00:00 |
|
|
9e55f6e4ec
|
Fix Tildes down for maintenance edge case
|
2019-10-22 05:01:30 +00:00 |
|
|
edc4c439d7
|
Prefetch first images
|
2019-10-19 07:33:06 +00:00 |
|
|
f8998b687e
|
Fix crash from domain and ext check bug
|
2019-10-16 08:56:31 +00:00 |
|
|
e4f81472fc
|
Fix copy/paste error, switch to info logging
|
2019-10-16 05:26:47 +00:00 |
|
|
f293f2b5f9
|
Begin README and add license
|
2019-10-15 16:40:55 -06:00 |
|
|
810e8c5ead
|
Archive WSJ articles first, catch KeyboardInterrupt
|
2019-10-15 21:03:47 +00:00 |
|
|
9c4766a928
|
Stop using python keyword id for id
|
2019-10-15 20:36:20 +00:00 |
|
|
0f5b2a5ff9
|
Cache all articles in IndexedDB
|
2019-10-12 23:41:31 +00:00 |
|
|
7cb87b59fe
|
Move archive to Whoosh and add search
|
2019-10-12 05:32:17 +00:00 |
|
|
45b75b420b
|
Gitkeep archive directory
|
2019-10-10 21:55:21 +00:00 |
|
|
f0721519e1
|
Serve client through apiserver, adding meta info
|
2019-10-10 21:54:29 +00:00 |
|
|
5fd4fdb21c
|
Fix Tildes comments with unknown authors
|
2019-10-08 08:01:17 +00:00 |
|
|
19e9a80be1
|
Archive Bloomberg articles first
|
2019-10-08 08:00:50 +00:00 |
|
|
5caa4542d8
|
Gitkeep apiserver data directory
|
2019-10-08 07:59:30 +00:00 |
|
|
0053147226
|
Ignore certain files and domains, remove refs
|
2019-09-24 08:22:06 +00:00 |
|
|
0496fbba45
|
Ignore new Tildes posts and handle deleted ones
|
2019-09-24 08:21:26 +00:00 |
|
|
0a1ebaa8b8
|
Handle Reddit PRAW exceptions
|
2019-09-24 08:20:46 +00:00 |
|
|
2ede5ed6ff
|
Filter out False comments
|
2019-08-30 06:23:14 +00:00 |
|
|
23cdbc9292
|
Render reddit markdown, poll tildes better, add utils
|
2019-08-28 04:13:02 +00:00 |
|
|
fc8ce79e33
|
Try outline.com for reader mode first
|
2019-08-25 23:49:08 +00:00 |
|
|
cf9e197e6c
|
Fix tildes comments parsing bug
|
2019-08-25 07:46:22 +00:00 |
|
|
1b6c8fc6cb
|
Add tildes to feeds
|
2019-08-25 00:36:26 +00:00 |
|
|
a2509958da
|
Add reddit to feeds
|
2019-08-24 21:37:43 +00:00 |
|
|
d341d4422f
|
Abstract api server feeds
|
2019-08-24 08:49:11 +00:00 |
|
|
c1a81a4d8c
|
Write news stories to disk
|
2019-08-24 05:07:16 +00:00 |
|
|
62d68da415
|
Finish prototype api server
|
2019-08-23 08:23:48 +00:00 |
|
|
c04b5c27f2
|
Figure out .gitignores
|
2019-08-23 08:23:26 +00:00 |
|