PDA

View Full Version : file locking and fcntl


sheila
08-29-2001, 02:37 AM
I'm working on some file locking techniques that I will be using in some scripts I want to write. I'm trying to make the durn thing cross-platform (Windows/Unix). The topic of this message deals with blocking vs. non-blocking calls to get a file lock.

The Windows commands I'm using do not offer blocking. The command either succeeds or fails. I have to write into my script, to try multiple times over some time interval. If still unsuccessful at the end of the time interval, the file-lock attempt will time out.

On Unix, I can try for blocking or non-blocking file locks. Of course, blocking saves me the trouble of having to write re-try attempts into my code.

But, I have a concern. Do I need to be worried, in a CGI script situation, about what will happen if a call to acquire a blocking lock is never returned? (I know that CGI scripts will time out eventually anyway. How long until the CGI scripts time out on the FutureQuest servers?)

I'm half inclined to do non-blocking calls on the FutureQuest servers, and have my script handle the retries. Am I being silly or overly paranoid?

Rich
08-30-2001, 03:25 PM
The Windows commands I'm using do not offer blocking.
A lot of how you handle flock on Windows will probably depend on which version of Windows you are targeting. I don't develop for the Windows platform, but I was my belief (maybe false) that the flock libraries for NT or greater supported blocking.

Do I need to be worried, in a CGI script situation, about what will happen if a call to acquire a blocking lock is never returned?
Of course, the answer here is 'yes'. :) But, whether you have a blocking lock that never returns or a non-blocking loop that fails, you have the same problem--the inability to perform the write that you wanted to perform.

How you handle the situation depends on the source of the failure to lock. If it's your application that controls access to the file, then simply using the blocking call is sufficient (assuming you have made sure your application is robust and properly locks/unlocks the files.)

If the source fo the locks are outside your apps control, then you'll have to design some strategy that your app and the other sources can live with. Under the worst-case scenario, you can always force an unlock or close of the file before processing it.

One approach to avoid all the looping logic is to:

(1) Try locking using LOCK_SH | LOCK_NB. Since this should succeed, you can error out if an error is encountered.

(2) Follow the above with your LOCK_EX bloking lock which should succeed.

(I know that CGI scripts will time out eventually anyway. How long until the CGI scripts time out on the FutureQuest servers?)
One of the Linux Admins may better answer this question, but I do not believe the script will time out. If a blocking lock fails, I believe this results in a dangling child process that is waiting at that point.

sheila
08-30-2001, 05:13 PM
Originally posted by Rich:

A lot of how you handle flock on Windows will probably depend on which version of Windows you are targeting. I don't develop for the Windows platform, but I was my belief (maybe false) that the flock libraries for NT or greater supported blocking. Yes, there is a difference whether you are targeting Win9x or WinNT/2000. There are quite a few more options for the latter. But I want to be able to run my scripts on my Win98 machine for testing, so that sort of limits me.(which is why I said, "The Windows commands I'm using do not offer blocking." I'm sort of limited by the Win9x thing.)

My solution to this problem targets the win32 API, and I chose the functions so that it should work with win9x/WinNT/Win2000. (So far I've tested it on my Win98 machine. I could go test it on my children's Win2K machine, but I haven't gotten that excited about it, yet.)

...<snipped>...
How you handle the situation depends on the source of the failure to lock. If it's your application that controls access to the file, then simply using the blocking call is sufficient (assuming you have made sure your application is robust and properly locks/unlocks the files.)

If the source fo the locks are outside your apps control, ...<snipped>...
Yes, this goes somewhere along the lines of the song, "I have a lot to learn...". For this particular stuff that I'm working on right now, it is only my own app that would be looking to access these files, so that makes it a bit simpler.

One approach to avoid all the looping logic is to:

(1) Try locking using LOCK_SH | LOCK_NB. Since this should succeed, you can error out if an error is encountered.

(2) Follow the above with your LOCK_EX bloking lock which should succeed.
I dont' see how LOCK_SH | LOCK_NB is guaranteed to succeed? What if another process is holding an exclusive lock on the file? Then it will fail.

FWIW, I also asked some of these same questions on the Python Tutor List (http://mail.python.org/mailman/listinfo/tutor) and after reading those responses, I've decided to use non-blocking calls with re-tries in my script.

For anyone who is interested, here is the Python code that I came up with:
The wrapper Parent class (MutexFile.py) (http://www.thinkspot.net/sheila/computers/python/MutexFile.py)
The posix platform specific version. (http://www.thinkspot.net/sheila/computers/python/posixMutexFile.py)
The windows specific class. (http://www.thinkspot.net/sheila/computers/python/winMutexFile.py)
I got mentioned on the Dr. Dobb's weekly Python URL. (http://groups.google.com/groups?as_umsgid=87629BC8980B9852.D7F7644DBC6F1DC6.2A794D1EDE2C87 1A%40lp.airnews.net) for working on this problem. My links above are the latest revision. ( That message about the Dr. Dobb's Python URL isn't available quite yet on groups.google, but maybe later today...
It basically gives a link to this thread:
http://groups.google.com/groups?th=f27682d95a0fa6c1
.)

One of the Linux Admins may better answer this question, but I do not believe the script will time out. If a blocking lock fails, I believe this results in a dangling child process that is waiting at that point.
Well, I'm definitely interested to hear the final word on that matter. In any case, I'm not using blocking locks, but I'm still interested in what happens when one fails (i.e. never returns). I thought that the cgi script processes eventually timed out?

Bruce
08-30-2001, 07:43 PM
To answer some of the questions:

There is a CPU time limit (20 seconds) but no wall time limit on CGIs. If your script has any chance of hanging on some external event (like a socket, a lock, etc), you are strongly encouraged to stick in a call to alarm to make sure that it cleans up eventually.

If your only locking options on Windows are non-blocking, then you would be best to use non-blocking locking on all platforms. Just remember to sleep for a second between polling the lock.

It's also worth noting that file locks in UNIX are automatically dropped as soon as the process that obtained the lock exits.

sheila
08-30-2001, 08:32 PM
Originally posted by Bruce:
There is a CPU time limit (20 seconds) but no wall time limit on CGIs.
...<snipped>...
Just remember to sleep for a second between polling the lock.

What is a wall time limit?
(Don't you just love, how answering questions generates more questions?)

Did you mean "sleep for a second" in a literal sense, as in one-sixtieth of a minute? That's an awful long time in cgi-execution time parameters. I asked about how long to let my script sleep between retries, and someone said that sleeping for 0.1 of a second was more than ample.

Thanks for the answers!

Rich
08-30-2001, 08:58 PM
I dont' see how LOCK_SH | LOCK_NB is guaranteed to succeed? What if another process is holding an exclusive lock on the file? Then it will fail.
Actualy, I said this 'should' succeed (if there is no exclusive lock). However, you are correct, this does not help circumvent an exclusive lock that never frees. It does allow you to know that you are going to have to wait for a lock.

I've always found file locking mechanisms tricky to understand (along with some speaking in forked pipes). I have learned to use locking sucessfully in production environments, but I always have to review the concepts carefully before using new techniques.

Having said that, and having reviewed all my notes, I should point out the other errors in my original response. :\

If you are using an advisory function like flock, then trying to get a lock in the presence of outside applications (those that might not correctly implement flock) is a waste of time. flock does not report anything about the open/close state of a file--only the state of other flock issued commands.

So, now that we are only concerned about our own application(s)...

The important point about using flock is that you should NEVER have an exclusive lock that doesn't get closed. If you do then you have a severe error in your application.

It's important to remember that the purpose of the locking mechanisms is to maintain file integrity during race conditions. Locking techniques should not be used to "work-around" locking problems. And, for that matter, neither should any routine.

<edit>
It took me a long time (and much pain) to learn the above lesson. But once I did, I started using flock as it was intended (in blocking mode) and applications that were previously error-prone started performing well even under heavy use.
</edit>

If you use a non-blocking, looping algorithm, on a system that supports blocking, then you are only detecting an error in your program and reporting it to the user. This is o.k. in testing phase, but is a warning sign for any production-intent code.

Rich

Bruce
08-31-2001, 01:32 AM
What is a wall time limit? Wall time (as in, a wall clock) refers to real time as opposed to CPU time. The term "wall time" is used because "real time" has other vastly different connotations or implications. A wall time limit, then, is a limit on the maximum amount of time a process is allowed to hang around, and there is no such limit. Did you mean "sleep for a second" in a literal sense, as in one-sixtieth of a minute? That's an awful long time in cgi-execution time parameters. I asked about how long to let my script sleep between retries, and someone said that sleeping for 0.1 of a second was more than ample. It really depends what the other process is going to do in that time. What you are doing is polling, and polling is almost always bad for two reasons: it is more CPU intensive and it doesn't scale. If it's a quick procedure (short time the lock is held), then polling rapidly is fine. If the lock may be held for long periods, then it's bad because of increased CPU usage. Similarly, if you have high contention for the lock, polling will cause the excluded processes to pile up and increase the time the lock is held as a side effect.

So, to summarize, if you expect to have minimal lock contention and it's a fast procedure that is being locked (no network I/O etc), then 100ms is probably adequate.

sheila
09-06-2001, 12:15 AM
Okay, I want to revisit this topic, again...(I hope that's OK?)

Rich wrote:

It's important to remember that the purpose of the locking mechanisms is to maintain file integrity during race conditions. Locking techniques should not be used to "work-around" locking problems. And, for that matter, neither should any routine.

OK, I gather that "race conditions" would be a situation where more than one process might be trying to access the file in a short time (several processes all grabbing for it), yes? I understand that lock techniques are meant to preserve the integrity of the file (this is the precise reason I'm thinking I need them). I'm having a hard time imagining how someone would use them as a "work around", and what other type of locking problems there are besides trying to maintain file integrity? (such that one would want to use this as a possible "work around")

Also from Rich:

If you use a non-blocking, looping algorithm, on a system that supports blocking, then you are only detecting an error in your program and reporting it to the user. This is o.k. in testing phase, but is a warning sign for any production-intent code.

I guess I'm really confused, here. Both you and Bruce are telling me that blocking locks are preferrable. I can easily buy into the idea that Bruce states: blocking locks are less CPU intensive. OK. Important point. I didn't realize that, before. Bruce indicates that locking for a long time can be a problem, especially if one is polling for the lock. I can see that, too. But I only intend to hold the lock very briefly, so it shouldn't be a problem.

I guess I wouldn't mind an example of how polling could go awry in production vs. testing code, if that's possible?

Also, now that Bruce has explained "wall time", I see that, indeed, a process can 'hang', in theory. Since it wouldn't be using any more CPU time, then the process wouldn't be terminated. I think that's what he's telling me.

Rich
09-06-2001, 12:57 AM
OK, I gather that "race conditions" would be a situation where more than one process might be trying to access the file in a short time (several processes all grabbing for it), yes?
Yes. However, you must remember that these "other processes" are the other instances of your application. (You could include in this 'other processes' list other applications--that you didn't write--that also use a robust implementation of an advisory locking method. However, this situation is rare.)

what other type of locking problems there are besides trying to maintain file integrity?
The "other" problems are all the reasons that someone modified the basic locking methods, i.e.

(here I'm talking about only systems that offer blocked calls. I'm not talking about Win95, etc.)

open(FH,"+<somefile");
flock(FH,LOCK_EX);
seek(where-ever-you-need-to-seek-to);
# do i/o stuff
close(FH);

My contention is that these "other" problems do not exist. However, if you change the above basic algorithm, you must ask yourself 'why am I changing this?' What are the problems you are solving with the change (like looping, etc.)?

My assertion is: if you change the above basic algorithm, then:

(1) You are trying to correct a bug somewhere else in your program, or

(2) You are trying to work around the "problem" that some other, external application might have the file open. (hint: YOU CANNOT WRITE ANY ADVISORY LOCKING ALGORITHM THAT WILL CORRECT THIS PROBLEM.)

I guess I wouldn't mind an example of how polling could go awry in production vs. testing code, if that's possible?
[this is the point that took me a long time to learn]
If you are polling, then why are you polling? The polling on the OS that does not support blocking makes sense and is a good design decision. However, if the OS supports blocking and you are polling to overcome a locking "problem" that blocking introduces, then you have an error in your application. To put a logic error into production is always a mistake. :)

Rich

sheila
09-06-2001, 01:26 AM
LOL. Ah, Rich. Thanks. You make me feel like I just had a good talking-to by my thesis advisor. (Sometimes, I know he just has to be asking himself, why I can't get the simple ideas through my thick head.)

I really don't have a "problem" to correct by doing a loop on non-blocking calls on the system that supports locking. (And, yes, I fully realize that the other processes are simply other instances of my application.)

Someone had suggested to me (or did I suggest it to myself??? I no longer recall...) that I implement the same way of dealing with things on both platforms. This is the "problem" I was trying to correct, I guess. Ewww! Trying to correct a *nix system to work like Windows. :BPG:
I am truly ashamed. I will now go sit in the corner.

~#

Rich
09-06-2001, 01:52 AM
...implement the same way of dealing with things on both platforms.
Yes, code portability would be a valid reason for doing this. The situation you originally stated (a personal application that might be used on more than one platform) is a special case and would warrant this.

I'm not sure, though, that I would want to support an application on a public Web site where the OS does not support blocking i/o locks.

Rich

Bruce
09-07-2001, 01:53 AM
OK, I gather that "race conditions" would be a situation where more than one process might be trying to access the file in a short time (several processes all grabbing for it), yes? A race condition is where the outcome of a computation is indeterminate due to the potential for one process affecting the other process' data. Any non-mutex-protected access to a shared resource (a file in your case) produces a race condition unless you are extremely careful.

I'm having a hard time imagining how someone would use them as a "work around"

Here's one way they could be used as a "work around".

The (old) traditional UNIX mailbox format is called "mbox" -- each user has one big file, and the start of each message is marked by the word "From " at the start of a line. In order to modify the mailbox in any way (including appending messages to it), you *must* lock it to prevent another accessor from either corrupting the data or even from reading partially-written data.

One of the new mailbox formats in use, and the format that FutureQuest uses, stores each message in a seperate file, and uses a three-stage delivery process (carefully create a uniquely named file in a temporary directory, write the data to that temp file and sync it to disk, and *then* move it into the visible "new" directory). Once written the files are never modified. Zero locks. Zero potential for corruption or lost data.

what other type of locking problems there are besides trying to maintain file integrity? In a web application, that would be the primary one, but there is also read integrity.

To guarantee read integrity, and potentially eliminate a lock, write data to a temporary file and only make it visible or permanent once the data is completely written, and then rename it (which atomically replaces the original file). Another technique for eliminating locks is to combine the above technique with storing each indivisible unit of data (in the mail example above, one email message) in a seperate file. That way you don't need to acquire a lock to add or remove pieces of data, and in some cases to modify the data.

I guess I wouldn't mind an example of how polling could go awry in production vs. testing code, if that's possible?Your production app sees more users than you expected. Suddenly there are many users trying to access the lock, and so much CPU time is being used to poll the lock that new accessors are added faster than the lock is released. At this point, Terra comes in and your application mysteriously disappears. ~#

That's pretty catastrophic a scenario, but imagine this: You set your polling interval to one second. Two tasks try to acquire the lock simultaneously, and one fails and sleeps for a second. While that one is sleeping, the first task releases its lock and another task comes in and grabs the lock just before the second is up. It tries again, and sleeps some more. Very low load, but the application is starved.

Terra
09-07-2001, 02:30 AM
At this point, Terra comes in and your application mysteriously disappears.
So this is not taken the wrong way, I will put forth comment...

I will not delete anyone's application for any reason, even if it is overloading the server...

I will either:
a) deactivate the program from being executed
or if (a) won't work
b) deactivate the entire account

If it is a MySQL overload, then I will revoke MySQL privileges for that account...

In almost all cases, I will issue a 'TOS Violation' providing proof for my deactivation action... I rarely ever pull the plug on something, unless I can formulate a clear picture of what is going on... It avoids conflict with the site owner when I have proven my case beyond the shadow of a doubt... Not many hosts will go to this length, however - I do...

In all cases, I do approach each situation as unique and gauge the severity of the problem before responding/reacting...

The only real time I will remove something from the server is when it involves cracking tools as that poses a very serious threat - and one for which I will not tolerate... I also will not tolerate eggdrop bots (or their family of programs like bnc), that will also pull an instant deactivation...

One more instance of instant deactivation is Spamming - for which I have little tolerance... More concerning this will be coming soon in a separate announcement...

--
Terra
sysAdmin
FutureQuest