PDA

View Full Version : size limit on .htaccess, robots.txt


Andilinks
08-15-2005, 05:17 PM
I have 282 pages that seem to offend the Google Search Team's new duplicate content filter, or at least that's my current theory since they insist on making it a guessing game.

I cannot just remove the bad pages because they get decent return traffic and some traffic from MSN and Yahoo which is paying the bills. So, I want to hide the pages from Gbot and there are three ways to do it, but all of them seem to entail huge lists (282 lines at least) in the .htaccess file or the robots.txt.

I'm inclined to go with the huge robots file since it is parsed far less often, but I would also like to isolate the offending pages in a separate directory and there again I'm looking at 282 lines of 301 redirect.

A mod_rewrite solution has also been presented to me but since it's difficult to test and involves even more lines than a 301 redirect I'm hesitant to use this.

I may also make the offending pages into a separate site with a new domain, but this still requires the huge 301 redirect...

May I get any suggestions or thoughts on this? This really is the worst problem I've ever had with my website. I'm looking at disposable sites and disposable domains as a possible solution to Google's banning an entire site rather than just penalizing the offending pages...

Andi

Wassercrats
08-15-2005, 05:27 PM
I have 282 pages that seem to offend the Google Search Team's new duplicate content filter, or at least that's my current theoryI was about to claim twice that many duplicate pages on my website, but I realized that they're duplicates of government webpages that have since been updated with new data, so I might have gotten lucky.

Personally, I'd try adding information to the pages, from various sources if not totally unique, rather than hiding them.

Andilinks
08-15-2005, 05:43 PM
Personally, I'd try adding information to the pages, from various sources if not totally unique, rather than hiding them.Perhaps you don't realize that it has taken many months to develop these pages and it would take just as long to legitimately add content. Waiting that long for them to lift the ban is unrealistic.

I could change it with text scrambling software, I have that. But that would degrade the quality of the content.

You see, none of my pages is a duplicate of another on the web. I use the meta descriptions and excerpts from the sites themselves and add comments and other data.

My site is not a scraper site, I hand edited every entry and added content to most entries, though I have used scraping software for 18 months now.

From my point of view it is Google that has gone Black Hat here. My pages are significantly superior to Google result pages. I may have gotten swept up in the scraper ban but the reality is my pages are their competition and not a dilution of their serps.

Andi

Andilinks
08-15-2005, 05:49 PM
Well, I did notice several FQ staff have viewed my initial post. I am going to go ahead with the 282 lines of 301 redirect since no one seemed to object...

johnfl68
08-15-2005, 09:37 PM
Andi:

You should also be able to use the Googlebot Meta Tag in those pages, if there turns out to be a problem with using the 301 redirects.

http://www.seoconsultants.com/meta-tags/robots/googlebot.asp

John

Andilinks
08-15-2005, 10:11 PM
This is great John, I didn't know I could specify just googlebot in the meta, this will come in handy in the future. The 301 redirects are working fine and gbot is respecting my Disallow: /s/ for the "s" directory. So I will retain this arrangement since I will redirect to a new domain as I "cleanse" the old content--an outrage!


Since Google remains silent or mysterious on just what they are banning and will ban an entire site for just a few offending pages I am going to have to experiment with multiple sites to see what works and what doesn't to stay ahead of them.

I have been very careful to stay within their guidlines for what seemed to be important factors but I think they have stabbed me in the back this time.

I have tended to side with Google when I have heard webmasters complaining about being dropped from the index for no good reason, but I am revisiting those complaints in a new light.

Their most recent content filter is stupid, reckless and vile in my case. I concede that worthless scraper sites with gobbledegook content was the target, but I am resentful and angry that I was caught in that net. No, my pages are not perfect but I have invested a lot of work into them, being banned with the scrapers is a high insult. /rant

Thanks again.

Andi

Andilinks
08-16-2005, 06:41 PM
Well, no one has bothered to answer the question posed in thread title but I am still wondering...

Will a 2.2k robots.txt file or a 20k .htaccess file cause any problems and if so, what?

With these files on andilinks.com and a new site about to open I will have made some progress with this...

Andi

Terra
08-16-2005, 07:22 PM
Will a 2.2k robots.txt file
No, this is the most preferred way for you to solve this problem with Google...

or a 20k .htaccess file cause any problems and if so, what?
The Apache engine must read, parse, and merge every item in this file for each and every request to your site... This will hurt your overall response time...

With the 'robots.txt' solution, Google reads and _respects_ the control directives within thereby avoiding the .htaccess performance hit... Your solution should remain scoped to Google, and not everyone that will access your site...

--
Terra
--Googlebot is as Googlebot does--
FutureQuest

Andilinks
08-16-2005, 07:49 PM
This will hurt your overall response time...Thanks for this response but I was hoping to hear a more relative answer like, "24k is WAY too big for an .htaccess file, it's huge, your site will crash" or "that's big but I've seen worse..."

Now that I've gone down that road I'll have to live with it until the SE indexes have caught up.

I'll also redirect to a new site so it will get worse before it gets better, but other than a general "will hurt your overall response time" I still have no idea how bad this is.

Andi

Wassercrats
08-16-2005, 07:59 PM
Can you do some strategic search and replaces to turn your htaccess file into a robots.txt file?

Andilinks
08-16-2005, 08:06 PM
Can you do some strategic search and replaces to turn your htaccess file into a robots.txt file?Yes, but it's already too late, msnbot, slurp. et al have indexed the new directory.

Mostly I want to direct the women's wear pages to the new site which is not operational yet. This will get worse before it gets better, but I think I'll survive it.

Andi

Terra
08-16-2005, 08:06 PM
I still have no idea how bad this is.
Your .htaccess file is a bit piggish, and I'd like to see an alternate solution in place - however I'm not noticing any major anomalies with the SONIC server at this point...

However, that .htaccess file will rear its ugly head if someone decides to machine gun (spider) your site...

Overall, you have to know how Apache merges in .htaccess statements into the current request_req... Some of the routines are simple and quick, while others must allocate dynamic memory to store the components of your directives...

From where I sit, the worst offender in your .htaccess file is 354 'redirect 301' statements... Apache must test the request to each and every URL that you have specified... This can burn up quite a few cycles and should be avoided if possible!

Given the nature of your site, I don't see an easy way to work around it without getting into complicated DB hash driven (fast) lookup tables...

--
Terra
--index.htm == 3dg.htm = no; index.htm == abcom.htm = no; .........; index.htm == will.shtm = no; ok - serve out index.htm as is--
FutureQuest

Andilinks
08-17-2005, 03:49 AM
Thanks Terra. Once Yahoo and MSN have the new URLs indexed I will track down other referrers and get them to change the inbound links. I can probably begin removing some 301 lines in a week or less. Maybe a meta refresh for the balance once the big SE's have updated.

Andi

georgeek
08-18-2005, 01:12 PM
I want to hide the pages from Gbot and there are three ways to do it, but all of them seem to entail huge lists (282 lines at least) in the .htaccess file or the robots.txt.The use of .htaccess for multiple 301s should be avoided and robots.txt can be unreliable when stuffed with page level exclusions.

By far the best way to solve your problem is to use <meta name="googlebot" content="noindex, nofollow"> in the header of the pages you don't want indexed.

I have used scraping software for 18 months now.Which software?

- George

Andilinks
08-18-2005, 01:36 PM
By far the best way to solve your problem is to use <meta name="googlebot" content="noindex, nofollow"> in the header of the pages you don't want indexed. Yes, I'm using both this and "disallow" combined. Having used massive 301's for a day and then reversing them I'm afraid I have slurp and msnbot spinning like tops. I'll be lucky if they don't ban me too.Which software?PJL's Links Suite 4. It's not actual scraping software, it simply spiders for meta descriptions--which is what I think got me in trouble with G. Unfortunately Google's policy on this reversed early this year when wholesale scraper spam began to degrade search. Prior to that my exerpts/meta descriptions were welcomed by Google.

I will diversify and split my directory into many niche sites.

Thanks George.

Andi

Andilinks
09-04-2005, 05:51 PM
By far the best way to solve your problem is to use <meta name="googlebot" content="noindex, nofollow"> in the header of the pages you don't want indexed. This unfortunately was an expensive mistake. Yahoo inexplicably obeyed the googlebot directive and deindexed all the pages with that meta. I have removed it and now Yahoo is coming back... Google is also indexing those pages but I figure it's only a matter of time before they ban them. If not I've beat this thing...

On the other hand Google may have just jumped the shark with this latest move. The only thing that's holding Google up now is the fact that all its competitors are Google wannabe's.

I think there's a big opening for a "new" Google using something other than inbound links which are worthless as indicators of importance due to all the link purchase/trade...

Andi

Wassercrats
09-04-2005, 06:37 PM
You can complain to Yahoo at url-support@yahoo-inc.com and you'll get the following autoresponse.This is an AUTORESPONSE message from Yahoo!

Your letter has been sent to our Yahoo! Search & Directory Support
agents and they will be reading it soon.

Our goal is to help you find solutions to Yahoo! Search or Directory
related inquiries that are not resolved by the documentation offered in
our online help area. Sometimes, this requires extra time and effort,
so we appreciate your patience.

While you are waiting for our email response, you may want to visit our
Frequently Asked Questions section to see if your question is addressed.

Find answers to questions about how to get your site into the Yahoo!
Search Index, how being listed in the index effects search results,
content guidelines, and how our crawler reads your site see our Search
Help pages:

http://help.yahoo.com/help/us/ysearch/

Find answers to questions about how to list and find your site in the
Yahoo! Directory, how a Directory listing relates to web search, how to
request changes to your listing, see our Directory Help pages:

http://help.yahoo.com/help/us/dir/

Also, please take a moment to ensure that you have sent your inquiry to
the correct Yahoo! support department mailbox. If you realize that your
inquiry would have been better served by another support department at
Yahoo!, you may refer to the list of Help links by Feature or Service
at:

http://help.yahoo.com/

NEW SEARCH & DIRECTORY SUBMISSIONS
For information about submitting your site to Yahoo! Search or the
Yahoo! Directory please refer to How to Submit Your Site at:

http://search.yahoo.com/info/submit.html

YAHOO! SEARCH RESULTS / SITE POSITION REQUESTS:
For information on how search results are compiled, please refer to the
Search FAQs here:

http://help.yahoo.com/help/us/ysearch/basics/index.html
http://help.yahoo.com/help/us/ysear...king/index.html

Please note that site positioning is determined by a complex ranking
algorithm used to assess relevance, and we can't change the order in
which sites appear. For more information on factors that influence
search results, please see our search help section:

http://help.yahoo.com/help/us/ysear.../basics-14.html

Thank you for taking the time to read this message. We look forward to
reading your inquiry.

- The Yahoo! Search & Directory Support Team


HELP PAGE RESOURCE LIST:

Yahoo! Help Central: - http://help.yahoo.com/

Directory Help: http://help.yahoo.com/help/us/dir/

Search Help - http://help.yahoo.com/help/us/ysearch/

Yahoo! Directory Submit Help - http://help.yahoo.com/help/us/express/index.htmlThat email address ( url-support@yahoo-inc.com ) hasn't been available on Yahoo's website for over four years, but it still works. The "personalized" response I got to my problem was:Hello Barry,

Thanks for writing the Yahoo! Search and Directory Support.

I understand that you have a question about why a specific page of your
site does not appear in Yahoo! Search.

Although we crawl billions of pages, we cannot guarantee that all pages
from a site will be reached or will be included in the search database.
We are always working to include more good pages, and the database is
updated daily to capture newly created and changing pages.
The goal of Yahoo! Search is to discover and index all of the content
available on the web to provide the best possible search experience to
users. The Yahoo! Search index, which contains several billion web
pages, is more than 99% populated through the free crawl process.

Yahoo! also offers several ways for content providers to submit web
pages and content directly to the Yahoo! Search index and the Yahoo!
Directory.

Yahoo! Search Submission

Submit Your Site for Free:
- Suggest your site for inclusion in Yahoo! Search Index (requires
registration).

Yahoo! Search Submit and Search Submit Express:
- Ensure that your web pages are included in the Yahoo! Search index.
- Guaranteed quality review by editors for relevance.
- Not intended for placement or ranking in search results.

http://searchmarketing.yahoo.com/srchsb/index.php

Pay-For-Performance":
- List your business in sponsored search results across the Web.
- Control your position by the amount you bid on keywords.
- Set your own price-per-click and pay only when a customer clicks
through to your site.

http://searchmarketing.yahoo.com/srch/index.php

Yahoo! Directory Submission

Yahoo! Directory Submit:
- Submit your site for review and inclusion in the Yahoo! Directory.
- Editorial review of your pages.
- Intuitive categorization of your site in the Yahoo! Directory.

http://searchmarketing.yahoo.com/dirsb/index.php

Yahoo Directory Standard:
- Submit your non-commercial site for free.

For more information please see Yahoo Submissions at:

http://search.yahoo.com/info/submit.html

For answers to other questions you may have regarding Yahoo! Search,
please see:
http://help.yahoo.com/help/us/ysearch/

For answers to other questions you may have regarding the Yahoo!
Directory, please see:
http://help.yahoo.com/help/us/dir/But the webpage I was complaining about was indexed the next day. I think Google is better technically and socially than Yahoo.

Andilinks
09-04-2005, 06:55 PM
I think Google is better technically and socially than Yahoo.That's probably true, but Y is an older if not larger company. For now Yahoo seems to be doing a good job of reindexing andifashion.com ... For all the 301 and meta twists and turns that I've given them I can't complain.

I'm looking for a new broom that sweeps clean. Google still understands its original mission but it is becoming ossified, imho.

Andi