Didn’t have the link to hand. But a search turned this one up: https://reggiodigital.com/blog/nginx-rule-blocking-bad-bots/ it looks to be the same list, and you can see the ones I’ve added to the end of that list.
I’m the administrator of kbin.life, a general purpose/tech orientated kbin instance.
Didn’t have the link to hand. But a search turned this one up: https://reggiodigital.com/blog/nginx-rule-blocking-bad-bots/ it looks to be the same list, and you can see the ones I’ve added to the end of that list.
Hmm, I took an original list and added to it. You got a website I can check? If so I’ll happily remove. I don’t mind slow web crawlers at all.
So on my mbin instance, it’s on cloudflare. So I filter the AS numbers there. Don’t even reach my server.
On the sites that aren’t behind cloudflare. Yep it’s on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there’s a website there to leech if they change their tactics for example.
You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.
Yeah, I probably should look to see if there’s any good plugins that do this on some community submission basis. Because yes, it’s a pain to keep up with whatever trick they’re doing next.
And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.
If you’re running nginx I am using the following:
if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }
That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!
I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):
AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)
Since these guys run or have run bots that impersonate real browser agents.
There are various tools online to return prefix/ip lists for an autonomous system number.
I put both into a single file and include it into my web site config files.
EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.
The sun always shines on pc.
I think it’ll be a “we’ll see” situation. This was the main concern for y2k. And I don’t doubt there’s some stuff that was partially patched from y2k still around that is still using string dates.
But the vast majority of software now works with timestamps and of course some things will need work. But with y2k the vast majority of business software needed changing. I think in this case the vast majority will be working correctly already and it’ll be the job of developers (probably in a panic less than a year before as is the custom) too catch the few outliers and yes some will escape through the cracks. But that was also the case last time round too.
You’re right on every point. But, I’m not sure how that goes against what I said.
Most applications now use the epoch for date and time storage, and for the 2038 problem the issues will be down to making sure either tiime_t or 64bit long values (and matching storage) which will be a much smaller change then was the case for y2k. Since more people also use libraries for date and time handling it’s also likely this will be handled.
Most databases have datetime types which again are almost certainly already ready for 2038.
I just don’t think the scale is going to be close to the same.
Not really processor based. The timestamp needs to be ulong (not advised but good for date ranges up to something like 2100, but cannot express dates before 1970). Or llong (long long). I think it’s a bad idea but I bet some people too lazy to change their database schema will just do this internally.
The type time_t in Linux is now 64bit regardless. So, compiling applications that used that will be fine. Of course it’s a problem if the database is storing 32bit signed integers. The type on the database can be changed too and this isn’t hard really.
As for the Y10K problem. It will almost entirely only be formatting problems I think. In the 80s and 90s, storage was at a premium, databases were generally much simpler and as such dates were very often stored as YYMMDD. There also wasn’t so much use of standard libraries. So this meant that to fix the Y2K problem required quite some work. In some cases there wasn’t time to make a proper solution. Where I was working there was a two step solution.
One team made the interim change to adjust where all dates were read and evaluate anything <30 (it wasn’t 30, it was another number but I forget which) to be 2000+number and anything else 1900+number. This meant the existing product would be fine for another 30 years or so.
The other team was writing the new version of the software, which used MSSQL server as a back-end, with proper datetime typed columns and worked properly with years before and after 2000.
I suspect this wasn’t unusual in terms of approach and most software is using some form of epoch datatype which should be fine in terms of storing, reading and writing dates beyond Y10K. But some hard-coded date format strings will need to be changed.
Source: I was there, 3000 years ago.
This is something I do on my new (Samsung) phones for the last 2 phones and this latest one I also turned off fast charging. On previous phones it was capped at 85%, current ones seem to have several options with the highest “saving” being to charge to 80%
If I’m going out for the day and need the full charge I turn it off for the duration and if I need a fast charge I turn that back on.
By and large though most of the time I keep it off. Seems to make the batteries last a bit longer. Too early to tell on the current phone though. Only a year old. I generally keep phones for around 4 years.
I used to do the opposite on the old nicad batteries phones had in the 90s. I’d carry a spare fully charged one, run the main one down to zero, swap them and then charge it to full. This made a huuuuge difference though.
It’s not how ActivityPub (at least Lemmy/*bin servers) works. There isn’t so far as I’ve ever seen an API that allows for this within ActivityPub (now specific to Lemmy/*bin implementations there’s the API the browser/apps use that must provide this, but that’s not ActivityPub). It actually looks to be cleverly designed to prevent it. It might look like backfilling is happening because old stuff appears, but there are reasons for this.
How it works from my experience (I did some work on the federation in kbin a year or so ago).
And so old posts and comments will begin to appear as activities linked to them happen. But there isn’t a method to ask for “all the posts in community X” using activity pub. I remember because I was specifically looking for this a year or so ago. It let’s you see the parent object but not any children.
Maybe Mastadon etc does it different? No idea.
And all of this is moot because if I block a User Agent, or I block an AS number/IP block. They’re not getting anything either by ActivityPub or scraping unless they change User Agent, AS number, or both.
I don’t think they’re optimising much at all. I think it’s likely just a modified web crawler but without the kind of throttling normal search engine crawlers use. They’re following links recursively. Then probably some basic parsing or even parsing with AI to prepare the data to make another AI model.
But, they aren’t. They’re not after Activitypub specifically. They’re scraping the whole internet, most of them using clear bot User Agents. So, I routinely block their bots because the AI ones are usually hitting you multiple times a second non-stop. If they started making fake Activitypub nodes they would not be scraping as a bot, and they would want specifically fediverse data. Important to note here though, an Activitypub node doesn’t “collect” data, they subscribe (to mastadon users/hashtags or communities) and then get new data delivered to them. So they wouldn’t get the old stuff.
Having said that, I’ve seen some obvious bots using genuine browser user agents on IP addresses from certain very large Chinese companies. For those I just blocked their whole AS number.
I mean personally I did block all the AI scrapers I could find on my instance, around a month or so ago. There were a lot, mostly unscrupulous, some big names included. Probably should look at the logs to see what’s new.
The amount of traffic was quite significant too. I have a theory that they expect legislation soon, so are not playing nice and slow like crawlers do, but are vacuuming as fast as they can.
But you’re right. Everyone would need to do it, to make a difference.
Yes, instead use rozzers, bacon or filth!
I don’t think it’s rose-tinted glasses really. I think it’s just the change in dynamic. It was definitely different during the “real” classic times (I would say classic to Wrath).
In 2005 when I started playing you needed to group up to get things done really. When you did this you met people. You talked, not with a microphone, but you would be talking. You’d get to know people, they’d invite you to dungeon groups and vice-versa, it would widen both of your in game circles and so on.
When I got to the position to raid, I was on an RP-PvP realm and while there were raiding guilds, many people were in smaller guilds that were either role-playing or guilds of friends. So, there were often raiding groups. I was in one of these, and we had our own guild chat-esque thing that everyone in the group could chat through and of course raids were mandatory voice. Because generally you did need to have communications to raid. This increased your in game circle too.
I still speak to some people now, on social media in various forms that I played the game with in 2005-2010. Some I met, others I never did. I’ve not really played retail much for a while now. But, it’s not the same. To an extent, neither is classic now.
Now, probably an unpopular opinion because I think a lot of people think Blizzard’s actions led to this change in community spirit. I actually think it’s the other way round. I think they saw their player-base changing, and adjusted the game to suit. The side effect is that it put off some of those with a more social gaming mindset for good. But, it would have happened anyway.
Times change, and they just rolled with it.
I did a routine upgrade on my mbin server, where I had an old version with changes I made myself.
Well turns out I upgraded something (probably redis) that broke symfony that broke everything.
So I had a fun afternoon upgrading to the latest mbin version. I mean I needed to anyway but my hand was forced.
Yep sometimes an innocent looking update will change your weekend plans.
Anyways, any reason not to use ssh?
The GDPR penalties are pretty serious for any reasonably large entity operating within Europe. I think when they’re actually pushed with a proper GDPR request, they will mostly comply.
And it’s risky to try to use that data. If someone, sometime in the future can prove their data was used after a confirmed GDPR request, it could be bad for them. And frankly, the number of actual GDPR requests is small enough that it’s not worth their while for such a small part of the sheer cascade of data they have.
Yes, for everyone else I don’t doubt they don’t actually delete anything.
I think it had its uses in the past, specifically if it had the memory backup to prevent full array rebuilds and cached data loss on power failure.
Also at the height of raid controller use (I would say 90s and 2000s) there probably was some compute savings by shifting the work to a dedicated controller.
In modern day, completely agree.
I feel like the only even remotely acceptable way to do this is to show the ad, prompt for the answer for 10 seconds. They can log the right/wrong answer or if the time expires the lack of one and must move on.
I can imagine metrics knowing if your advertising is actually reaching people is valid. But to make people answer and especially make them watch more if they answer wrong is about as dystopian as it gets.
If (and I say if, I really don’t want to believe it is) that is the case, the only correct response is to uninstall Hulu immediately and put on your pirate hat.